# DESCRIPTION ``` Package: rnaends Title: RNA-ends sequencing specialized package featuring pre-processing, mapping, quantification and analysis of 5'-end or 3'-end RNAs sequencing data Version: 1.0.0 Authors@R: c( person("Roland", "Barriot", , "roland.barriot@gmail.com", role = c("aut", "cre"), comment = c(ORCID = "0000-0003-0298-2696")), person("Tomas", "Caetano", role = "aut", comment = c(ORCID = "0009-0008-4678-2796")), person("Gwennaele", "Fichant", role = "ctb", comment = c(ORCID = "0000-0001-5480-0239")), person("Peter", "Redder", role = "ctb", comment = c(ORCID = "0000-0002-3619-632X")) ) Description: The rnaends R package focuses on exact 5'-end and/or 3'-end quantification of RNAs via targeted RNA-Seq. License: GPL (>= 3) Imports: tidyverse (>= 2.0.0), ggupset (>= 0.4), DSS (>= 2.48.0), Rbowtie (>= 1.42.0), Rsamtools (>= 2.16.0), Rsubread (>= 2.16.0), ShortRead (>= 1.58.0), survcomp (>= 1.52.0), VGAM (>= 1.1-9) Remotes: bioc/Biostrings, bioc/DSS, bioc/Rbowtie, bioc/Rsamtools, bioc/Rsubread, bioc/ShortRead, bioc/survcomp Suggests: knitr (>= 1.49), rmarkdown (>= 2.29), testthat (>= 3.0.0) Encoding: UTF-8 Roxygen: list(markdown = TRUE) RoxygenNote: 7.3.2 LazyData: true VignetteBuilder: knitr Depends: R (>= 4.4.2) ``` # `add_read_feature` Add a feature to the `read_features` object. ## Description This function adds a feature to the `read_features` tibble returned by `init_read_features` or `add_read_feature`. ## Usage ```r add_read_feature( read_features, name, start, width, pattern, pattern_type = "constant", max_mismatch = 0, readid_prepend = FALSE ) ``` ## Arguments * `read_features`: `read_features` tibble * `name`: Name of the feature. * `start`: Relative start of the feature. * `width`: Width of the feature. * `pattern`: Characters allowed in the feature sequence. * `pattern_type`: Type of the pattern given: "nucleotides", "constant" or "regexp". * `max_mismatch`: Maximum number of mismatches allowed in the feature sequence. * `readid_prepend`: If TRUE the sequence of this feature will be prepended to the read id (in front of) in the FASTQ file. ## Seealso [read_feature_pattern_types](read-feature-pattern-types) for pattern types, [parse_reads](parse-reads), [demultiplex](demultiplex) ## Value The given read_features object with the added feature # `demultiplex` Demultiplex a FASTQ file according to barcode sequence reads provided through a `read_features` object ## Description This function parses each read to retrieve features in the sequence and test their validity. Valid reads are written in separate files (fastq_file.barcode.fastq.gz) and if keep_invalid_reads = TRUE, invalid reads are write in fastq_file.invalid.fastq.gz It also writes a feature report which contains features given in the features parameter and a demultiplex report which contains number of valid features for each barcode. ## Usage ```r demultiplex( fastq_file, features, out_dir = NULL, keep_invalid_reads = FALSE, yieldsize = 1e+06, min_read_length = 12, write_read_parsing_tb = FALSE, force = FALSE ) ``` ## Arguments * `fastq_file`: Path of the FASTQ file (eventually gzipped) containing single-end reads * `features`: read_features object * `out_dir`: Path of the output directory where the demultiplexed FASTQ files will be stored * `keep_invalid_reads`: If TRUE invalid reads will be kept in a dedicated file * `yieldsize`: Number of read to process per iteration * `force`: If TRUE recursively remove the content of out_dir before running instead of stopping the function with an error message ## Value A tibble with demultiplexing statistics # `differential_proportion` Wrapper of differential_proportion_one_stranded function allow to compare proportion of both strand ## Description Wrapper of differential_proportion_one_stranded function allow to compare proportion of both strand ## Usage ```r differential_proportion( count_table, design_matrix, group1, group2, cores = 1, ... ) ``` ## Arguments * `count_table`: Count table * `design_matrix`: Description of the sample_ids (it must contains a sample_id,group,replicate,count_of(total/fraction) columns) * `group1`: Name of strain 1 * `group2`: Name of strain 2 * `cores`: number of cores used in the mclappy function * `...`: arguments that can be pass to the DMLtest function ## Value a tibble with for each position the pvalue # `differential_proportion2` Statistical comparison of 2 sets of proportion values taking into account replicates ## Description Statistical comparison of 2 sets of proportion values taking into account replicates ## Usage ```r differential_proportion2(count_table, design_matrix, group1, group2, ...) ``` ## Arguments * `count_table`: Count table * `design_matrix`: Description of the samples (it must contains a sample_id, group, replicate, count_of(fraction/total) columns) * `group1`: Name of strain 1 * `group2`: Name of strain 2 * `...`: arguments that can be pass to the DMLtest function ## Value a tibble with for each position the pvalue # `.parse_yield` Parse features in a yield from a FASTQ stream reader ## Description Parse features in a yield from a FASTQ stream reader ## Usage ```r .parse_yield(yield, features, min_read_length) ``` ## Arguments * `yield`: A yield of ShortReads from a FastQStreamer. * `features`: `read_features` to be parsed. * `min_read_length`: minimum length of readseq after parsing. ## Value a tibble with statistics for each read feature. # `.trim_yield` Trim ShortReads in a yield of a FastQStreamer to keep only a `readseq` feature of a `read_features` object. ## Description Trim ShortReads in a yield of a FastQStreamer to keep only a `readseq` feature of a `read_features` object. ## Usage ```r .trim_yield(yield, features, read_parsed) ``` ## Arguments * `yield`: A yield of ShortReads from a FastQStreamer. * `features`: `read_features` to be parsed. * `read_parsed`: minimum length of readseq after parsing. ## Value FastQStream yield with read sequences trimmed to the region corresponding to the `readseq` feature of an `read_features` object. # `FFT_profile` FFT analysis applied to counts or metacounts to detect periodicity ## Description This function is a wrapper for the `stats::fft` to detect periodicity in the signal composed of counts (of RNA ends) on a given region (or metacounts profile) ## Usage ```r FFT_profile(counts, max_period = 25) ``` ## Arguments * `counts`: table with positions (x) and values (counts) * `max_period`: results haivng a period over this threshold are discarded ## Value A tibble with `signal`, `freq` and `period` columns. # `FFT` FFT analysis over one or more regions ## Description This function computes the periodicity of the (meta)gene count profiles ## Usage ```r FFT( count_table = NULL, region_tb, by_sample = FALSE, by_region = FALSE, min_quantile = 0.25, max_period = 25, ... ) ``` ## Arguments * `count_table`: Count table (if not empty bam_files will be ignored) * `region_tb`: a tibble with `rname`, `strand` and `position` for all regions to extract * `by_sample`: if FALSE, the counts are aggregated (sum) over the different samples * `by_region`: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region. * `min_quantile`: if by_region is FALSE and the regions are aggregated, all regions must be the same length. In this case, min_quantile is used to compute this lenght wich is given by the quantile of the distribution of all the region lentghts. By default, it is 0.25 which means that 75% of the regions are at least this length. * `max_period`: results haivng a period over this threshold are discarded * `...`: parameters of the quantify_from_bam_files function if no count table is provided ## Value A tibble with at least `signal`, `freq` and `period` columns. If `by_sample` and/or `by_region` is TRUE, the corresponding columns are also present in the results, *i.e.* respectively `sample_id` and/or `(rname, strand, location)`. # `frame_preference` Global frame preference ## Description This function computes the global frame preference of the reads distribution by computing the number of counts falling into each translation frame. ## Usage ```r frame_preference( count_table = NULL, region_tb, by_sample = FALSE, by_region = FALSE, ... ) ``` ## Arguments * `count_table`: (tibble) Count table (if not empty `bam_files` will be ignored) * `region_tb`: (character) Path of a gff file * `by_sample`: if FALSE, the counts are aggregated (sum) over the different samples * `by_region`: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region. ## Value A tibble with `frame` and `pc` columns. # `init_read_features` Initialize a `read_features` object. ## Description Initializes the `read_features` object with the minimal feature *readseq* corresponding to the region in the read sequence that is expected to align to a reference sequence. ## Usage ```r init_read_features( start, width, pattern = "ACTG", max_mismatch = 0, read_features_file = NA ) ``` ## Arguments * `start`: Relative start of the *readseq* feature. * `width`: Width of the *readseq* feature. * `pattern`: Characters allowed in the *readseq* sequence. * `max_mismatch`: Maximum number of mismatches allowed in the *readseq* sequence. * `read_features_file`: Path of the *read_feature* CSV file if already built and saved. ## Seealso [add_read_feature](add-read-feature) for adding other features, [parse_reads](parse-reads), [demultiplex](demultiplex) ## Value A *read_features* tibble with at least 1 row corresponding to the region of the read sequence that is to be aligned to reference sequences. ## Examples ```r # only a small region is to be aligned rf = init_read_features(start = 14, width = 19) # the whole 100nt long read should be aligned to the reference rf = init_read_features(start = 1, width = 100) rf ``` # `load_bam_as_tibble` Bam file parser ## Description This function parse a given bam file and return different mapping informations in a tibble ## Usage ```r load_bam_as_tibble(bam_file, features = NULL) ``` ## Arguments * `bam_file`: (character) Path of the bam file * `UMI_length`: (integer) Width of the UMI (if UMIs have not been put in the qname let this value default) ## Value (tibble) with mapping informations # `map_fastq_files` Wrapper of the map_fastq_file function which allows the mapping of multiples FASTQ files. ## Description The function takes a vector of FASTQ file paths and others map_fastq parameters. All parameters can be replaced by a "mapping table" which is a tibble where each column correspond to a parameters of the map_fastq function. ## Usage ```r map_fastq_files( list_fastq_files, fasta_name, index_name, max_map_mismatch = 0, bam_dir = "", aligner = "bowtie", threads = 3, map_unique = TRUE, force = FALSE, mapping_tb = NULL ) ``` ## Arguments * `list_fastq_files`: (character) or (vector of character) Path of the FASTQ file (eventually gzipped) containing single-end EMOTE reads * `fasta_name`: (character) Path of the FASTA genome reference file * `index_name`: (character) Name of the index (index will be created if not exist) * `max_map_mismatch`: a int corresponding to the number of allowed mismatch * `bam_dir`: (character) Path of the output directory * `aligner`: (character) Mapping aligner (either "bowtie" or "subread") * `threads`: (integer) Number of threads used * `force`: (logical) If TRUE overwrite already existing bam file * `mapping_tb`: (tibble) where colnames = parameters of the map_fastq_file function ## Value (character) bam file created # `map_fastq` Perform the mapping of a FASTQ file using either bowtie or subread method. ## Description The function takes a FASTQ file and a genome FASTA file. An index is created if not existing, reads are mapped on the given reference genome. ## Usage ```r map_fastq( fastq_file, fasta_name, index_name, max_map_mismatch = 0, bam_dir = "", aligner = "bowtie", threads = 3, map_unique = TRUE, force = FALSE ) ``` ## Arguments * `fastq_file`: Path of the FASTQ file (eventually gzipped) containing single-end reads * `fasta_name`: Path of the FASTA genome reference file * `index_name`: Name of the index (index will be created if not exist) * `max_map_mismatch`: a int corresponding to the number of allowed mismatch * `bam_dir`: Path of the output directory * `aligner`: Mapping aligner (either "bowtie" or "subread") * `threads`: Number of threads used * `force`: If TRUE overwrite already existing bam file ## Value Path of BAM file created # `map_stats_bam_files` Mapping stat ## Description This function parses the specified bam_files and returns simple stats such as the percentage of reads mapped. The result is saved in bam_file.map_stats.csv so that it is not done again on the same BAM file if it is older that the stats or force is set to FALSE. ## Usage ```r map_stats_bam_files(bam_files, force = FALSE) ``` ## Arguments * `bam_files`: vector of BAM filenames * `force`: (default FALSE) Force the computation of stats. ## Value tibble with mapping statistics # `map_stats` Mapping stat ## Description This function parses the specified BAM FILE in bam_file and fastq_files and returns mapping stats such reads mapped. The result is saved in bam_file.map_stats.csv so that it is not done again on the same BAM file if it is older that the stats or force is set to FALSE. ## Usage ```r map_stats(bam_file, force = FALSE) ``` ## Arguments * `bam_file`: Path of the BAM filename. * `force`: (default FALSE) Force the computation of stats. ## Value mapping statistics. # `metacounts_feature` retrieve normalized (meta)counts around given feature(s) specified by a tibble/data.frame with their `rname`, `strand` and `position.` ## Description This function takes as input the features to extract and the `width` to consider around each feature specified by its `rname`, `strand`and `position`. ## Usage ```r metacounts_feature( count_table = NULL, feature_tb, width = 100, by_sample = FALSE, by_feature = FALSE, ... ) ``` ## Arguments * `count_table`: count table (if not empty `bam_files` will be ignored) * `feature_tb`: a tibble with `rname`, `strand` and `position` for all regions to extract * `width`: positions to extract upstream and downstream of the features * `by_sample`: if FALSE, the counts are aggregated over the different samples * `by_feature`: if FALSE, the counts are aggregated over all regions aligned on the starting location of the feature. ## Value A tibble with at least `x` and `count` columns. `x` is the relative distance to the location on the reference sequence `rname` and `strand`, and `count` is the normalized count on the region (counts sum is 1). If `by_sample` and/or `by_feature` is TRUE, the corresponding columns are also present in the results, i.e. respectively `sample_id` and/or `(rname, strand, location)`. # `metacounts_normalized` normalize (meta)counts on region(s) specified by strand, location and width ## Description This function takes a input one or more regions and normalizes the counts by region and sample_id so that they sum up to 1. It is used to balance the effect of highly transcribed over less transcribed regions, and also varying sequencing depths. ## Usage ```r metacounts_normalized( count_table, rname, strand, location, width, around = FALSE, by_sample = FALSE, by_region = FALSE, min_quantile = 0.25 ) ``` ## Arguments * `count_table`: count table as returned from quantification, i.e. a tibble with the following columns: sample_id, rname, position, strand, count. * `rname`: rname(s) one reference/chromosome name or a vector of names. * `strand`: strand(s) a vector of the same length as rname to specify each strand. * `location`: location(s) a vector of the same length as rname to specify each position on the reference. * `width`: width(s) a vector of the same length as rname to specify the width upstream and downstream the location (if param around is TRUE) or the length of the region (if param around is FALSE)0 * `around`: specify if the location is the begninning of the region or its the center. * `by_sample`: if FALSE, the counts are aggregated (sum) over the different samples * `by_region`: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region. * `min_quantile`: if by_region is FALSE and the regions are aggregated, all regions must be the same length. In this case, min_quantile is used to compute this lenght wich is given by the quantile of the distribution of all the region lentghts. By default, it is 0.25 which means that 75% of the regions are at least this length. ## Value A tibble with at least `x` and `count` columns. `x` is the relative distance to the location on the reference sequence `rname` and `strand`, and `count` is the normalized count on the region (counts sum is 1). If `by_sample` and/or `by_region` is TRUE, the corresponding columns are also present in the results, i.e. respectively `sample_id` and/or `(rname, strand, location)`. # `metacounts_region` extract normalized counts for a given feature and optionnaly aggregates them into metacounts ## Description This function takes a input one or more regions and normalizes the counts by region and sample_id so that they sum up to 1. It is used to balance the effect of highly transcribed over less transcribed regions, and also varying sequencing depths. ## Usage ```r metacounts_region( count_table = NULL, region_tb, by_sample = FALSE, by_region = FALSE, min_quantile = 0.25, ... ) ``` ## Arguments * `count_table`: count table (if not empty bam_files will be ignored) * `region_tb`: a tibble with `rname`, `strand` and `position` for all regions to extract * `by_sample`: if FALSE, the counts are aggregated (sum) over the different samples * `by_region`: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region. * `width`: positions to extract downstream of the specified start position ## Value A tibble with at least x and count columns. x is the relative distance to the feature and count is normalized count on the region. If `by_sample` and/or `by_region` is TRUE, the corresponding columns are also present in the results, i.e. respectively `sample_id` and/or `(rname, strand, location)`. # `parse_reads` Parse each read from a FASTQ file to retrieve features in the read sequence and test their validity. ## Description Valid reads are written to a dedicated file and if `keep_invalid_reads` is TRUE, invalid reads are written to `fastq_file.invalid.fastq.gz`. It also writes a feature report which contains features given in the features parameter and a parsing report which contains number of read with valid features. ## Usage ```r parse_reads( fastq_file, features, out_dir = NULL, keep_invalid_reads = FALSE, min_read_length = 18, yieldsize = 1e+06, write_read_parsing_tb = FALSE, force = FALSE ) ``` ## Arguments * `fastq_file`: Path of the FASTQ file (eventually gzipped) containing single-end reads * `features`: read_features tibble * `out_dir`: Path of the output directory where the demultiplexed FASTQ files will be stored * `keep_invalid_reads`: If TRUE invalid reads will be kept in a dedicated file * `min_read_length`: Minimum width of reads after parsing * `yieldsize`: Number of read to process per iteration * `write_read_parsing_tb`: writes statistics table to file * `force`: If TRUE recursively remove the content of out_dir before running instead of stopping the function with an error message ## Value A tibble with parsing statistics. # `print.read_features` Utility function to display read_features. ## Description Utility function to display read_features. ## Usage ```r # S3 method for read_features print(rf) ``` ## Arguments * `rf`: `read_features` tibble # `quantify_3prime` Function used to quantify the exact 3'-ends of mapped reads ## Description Caller of load_bam_as_tibble and summarise_alignements ## Usage ```r quantify_3prime( bam_file, features = NULL, keep_UMI = FALSE, min_mapped_length = NA, rev_comp_reads = FALSE, force = FALSE ) ``` ## Arguments * `bam_file`: (character) Path of the bam file containing single-end EMOTE reads mapped * `keep_UMI`: (logical) If FALSE the count value corresponds to the number fo different UMI at an exact position ## Value a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0) # `quantify_5prime` Function used to quantify the exact 5'-ends of mapped reads ## Description Caller of load_bam_as_tibble and summarise_alignements ## Usage ```r quantify_5prime(bam_file, features = NULL, keep_UMI = FALSE, force = FALSE) ``` ## Arguments * `keep_UMI`: (logical) If FALSE the count value corresponds to the number fo different UMI at an exact position * `bam_files`: (character) Path of the bam file containing single-end EMOTE reads mapped * `UMI_length`: (integer) Width of UMI, default is 0 (if 0 the read id qname will be return) ## Value a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0) # `quantify_downstream` Function used to quantify the exact 3'-ends of mapped reads in XU et al. 2023 ## Description Function used to quantify the exact 3'-ends of mapped reads in XU et al. 2023 ## Usage ```r quantify_downstream(bam_file, umi.length, min.map.length, force = FALSE) ``` ## Arguments * `bam_file`: (character) Path of the bam file containing single-end EMOTE reads mapped ## Value a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0) # `quantify_from_bam_files` Wrapper of quantify_5prime and quantify_3prime quantification functions. ## Description This function takes a vector of bam files and call the "fun" quantification function. It allow the parralelization through the "cores" parameters ## Usage ```r quantify_from_bam_files(bam_files, sample_ids, fun, ...) ``` ## Arguments * `bam_files`: (vector of characters) Path of the bam files containing single-end EMOTE reads mapped * `sample_ids`: (vector of characters) corresponding to a string for identifying the source of reads * `fun`: (character) The quantification method used (quantify_3prime or quantify_5prime) * `...`: parameters to give to the "fun" function * `cores`: (integer) Number of cores ## Value a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0) # `read_feature_pattern_types` Read feature pattern types ## Description Enumeration of the allowed pattern types. ## Usage ```r read_feature_pattern_types ``` ## Keyword datasets ## Format An object of class `character` of length 3. ## Details A vector of characters with 3 elements: *nucleotides*, *constant* and *regexp*: * `nucleotides`: for random sequences composed of the specifies nucleotides (defaults are A, T, C , G) *i.e* the mappable *readseq* or the *UMIs*. * `constant`: for a constant string or a vector of constant strings, *i.e.* allowed *barcodes* for multiplexed samples or some *control sequence* that should be present in the reads. A maximum number of mismatches allowed can be specified. * `regepx`: for more complicated patterns that can occur anywhere such as polyA tails. # `reads_parse_report_ggupset` Utility function to plot an upset plot from `parse_reads` and `demultiplex` reports using ggupset package. ## Description Utility function to plot an upset plot from `parse_reads` and `demultiplex`reports using ggupset package. ## Usage ```r reads_parse_report_ggupset(report, angle = 45, margin = 0.1) ``` ## Arguments * `report`: report returned by `parse_reads` or `demultiplex` # `reads_parse_report_to_upsetr_expression` Utility function to convert `parse_reads` and `demultiplex` reports to expression for UpSetR plot. ## Description Utility function to convert `parse_reads` and `demultiplex` reports to expression for UpSetR plot. ## Usage ```r reads_parse_report_to_upsetr_expression(report) ``` ## Arguments * `report`: report returned by `parse_reads` or `demultiplex` # `summarise_alignements` Parse mapping information retrieve by load_bam_as_tibble ## Description This function parse the output of load_bam_as_tibble and count the number of reads mapped at each exact genomic position ## Usage ```r summarise_alignements(alns, UMI_length, keep_UMI) ``` ## Arguments * `alns`: (tibble) Mapping table returned by load_bam_as_tibble * `UMI_length`: (integer) Width of the UMI (if UMIs have not been put in the qname let this value default) * `keep_UMI`: (logical) If TRUE the result tibble will display the count value for each UMI at each exact position ## Value (tibble) count table # `TSS_betabinomial` Computes a p-value for a single position using the betabinomial model as in TSS-EMOTE paper. ## Description Computes a p-value for a single position using the betabinomial model as in TSS-EMOTE paper. ## Usage ```r TSS_betabinomial(k_total, k_mono, n_total, n_mono) ``` ## Arguments * `k_total`: Counts on the +RppH (total) at a specific position * `k_mono`: Counts on the -RppH (mono) at a specific position * `n_total`: Total counts on the +RppH (total) in the RNA-Seq library * `n_mono`: Total counts on the -RppH (mono) in the RNA-Seq library ## Value The p-value computed for the given position. # `TSS` Identification of Transcription Start Sites (TSS) as in TSS-EMOTE paper ## Description This function computes a p-value for each position present in a count table. P-values of replicate samples are combined and adjusted for multiple testing. ## Usage ```r TSS( count_table, design_table, combine_method = "fisher", fdr_method = "BH", threshold = 5 ) ``` ## Arguments * `count_table`: Count table * `design_table`: Description of the samples. It must contains the following columns: (`sample_id`, `group`, `replicate`, `count_of`(total/fraction). * `combine_method`: P-values combination method to use, default is "fisher" ("fisher", "z.transform", "logit") * `fdr_method`: P-values adjustment method, default is "BH" ("holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr", "none")) * `threshold`: Minimum counts in the total sample for the position to be considered ## Value ### A tibble containing the results for each position, *i.e.* with columns `position`, `strand`, `pvalue`, `pvalue_adj`.