DESCRIPTION
Package: rnaends
Title: RNA-ends sequencing specialized package featuring pre-processing, mapping, quantification and analysis of 5'-end or 3'-end RNAs sequencing data
Version: 1.0.0
Authors@R: c(
person("Roland", "Barriot", , "roland.barriot@gmail.com", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-0298-2696")),
person("Tomas", "Caetano", role = "aut",
comment = c(ORCID = "0009-0008-4678-2796")),
person("Gwennaele", "Fichant", role = "ctb",
comment = c(ORCID = "0000-0001-5480-0239")),
person("Peter", "Redder", role = "ctb",
comment = c(ORCID = "0000-0002-3619-632X"))
)
Description: The rnaends R package focuses on exact 5'-end and/or 3'-end quantification of RNAs via targeted RNA-Seq.
License: GPL (>= 3)
Imports:
tidyverse (>= 2.0.0),
ggupset (>= 0.4),
DSS (>= 2.48.0),
Rbowtie (>= 1.42.0),
Rsamtools (>= 2.16.0),
Rsubread (>= 2.16.0),
ShortRead (>= 1.58.0),
survcomp (>= 1.52.0),
VGAM (>= 1.1-9)
Remotes:
bioc/Biostrings,
bioc/DSS,
bioc/Rbowtie,
bioc/Rsamtools,
bioc/Rsubread,
bioc/ShortRead,
bioc/survcomp
Suggests:
knitr (>= 1.49),
rmarkdown (>= 2.29),
testthat (>= 3.0.0)
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
LazyData: true
VignetteBuilder: knitr
Depends:
R (>= 4.4.2)
add_read_feature
Add a feature to the read_features
object.
Description
This function adds a feature to the read_features
tibble returned by init_read_features
or add_read_feature
.
Usage
add_read_feature(
read_features,
name,
start,
width,
pattern,
pattern_type = "constant",
max_mismatch = 0,
readid_prepend = FALSE
)
Arguments
read_features
:read_features
tibblename
: Name of the feature.start
: Relative start of the feature.width
: Width of the feature.pattern
: Characters allowed in the feature sequence.pattern_type
: Type of the pattern given: “nucleotides”, “constant” or “regexp”.max_mismatch
: Maximum number of mismatches allowed in the feature sequence.readid_prepend
: If TRUE the sequence of this feature will be prepended to the read id (in front of) in the FASTQ file.
Seealso
read_feature_pattern_types for pattern types, parse_reads, demultiplex
Value
The given read_features object with the added feature
demultiplex
Demultiplex a FASTQ file according to barcode sequence reads provided through a read_features
object
Description
This function parses each read to retrieve features in the sequence and test their validity. Valid reads are written in separate files (fastq_file.barcode.fastq.gz) and if keep_invalid_reads = TRUE, invalid reads are write in fastq_file.invalid.fastq.gz It also writes a feature report which contains features given in the features parameter and a demultiplex report which contains number of valid features for each barcode.
Usage
demultiplex(
fastq_file,
features,
out_dir = NULL,
keep_invalid_reads = FALSE,
yieldsize = 1e+06,
min_read_length = 12,
write_read_parsing_tb = FALSE,
force = FALSE
)
Arguments
fastq_file
: Path of the FASTQ file (eventually gzipped) containing single-end readsfeatures
: read_features objectout_dir
: Path of the output directory where the demultiplexed FASTQ files will be storedkeep_invalid_reads
: If TRUE invalid reads will be kept in a dedicated fileyieldsize
: Number of read to process per iterationforce
: If TRUE recursively remove the content of out_dir before running instead of stopping the function with an error message
Value
A tibble with demultiplexing statistics
differential_proportion
Wrapper of differential_proportion_one_stranded function allow to compare proportion of both strand
Description
Wrapper of differential_proportion_one_stranded function allow to compare proportion of both strand
Usage
differential_proportion(
count_table,
design_matrix,
group1,
group2,
cores = 1,
...
)
Arguments
count_table
: Count tabledesign_matrix
: Description of the sample_ids (it must contains a sample_id,group,replicate,count_of(total/fraction) columns)group1
: Name of strain 1group2
: Name of strain 2cores
: number of cores used in the mclappy function...
: arguments that can be pass to the DMLtest function
Value
a tibble with for each position the pvalue
differential_proportion2
Statistical comparison of 2 sets of proportion values taking into account replicates
Description
Statistical comparison of 2 sets of proportion values taking into account replicates
Usage
differential_proportion2(count_table, design_matrix, group1, group2, ...)
Arguments
count_table
: Count tabledesign_matrix
: Description of the samples (it must contains a sample_id, group, replicate, count_of(fraction/total) columns)group1
: Name of strain 1group2
: Name of strain 2...
: arguments that can be pass to the DMLtest function
Value
a tibble with for each position the pvalue
.parse_yield
Parse features in a yield from a FASTQ stream reader
Description
Parse features in a yield from a FASTQ stream reader
Usage
.parse_yield(yield, features, min_read_length)
Arguments
yield
: A yield of ShortReads from a FastQStreamer.features
:read_features
to be parsed.min_read_length
: minimum length of readseq after parsing.
Value
a tibble with statistics for each read feature.
.trim_yield
Trim ShortReads in a yield of a FastQStreamer to keep only a readseq
feature of a read_features
object.
Description
Trim ShortReads in a yield of a FastQStreamer to keep only a readseq
feature of a read_features
object.
Usage
.trim_yield(yield, features, read_parsed)
Arguments
yield
: A yield of ShortReads from a FastQStreamer.features
:read_features
to be parsed.read_parsed
: minimum length of readseq after parsing.
Value
FastQStream yield with read sequences trimmed to the region corresponding to the readseq
feature of an read_features
object.
FFT_profile
FFT analysis applied to counts or metacounts to detect periodicity
Description
This function is a wrapper for the stats::fft
to detect
periodicity in the signal composed of counts (of RNA ends) on a given
region (or metacounts profile)
Usage
FFT_profile(counts, max_period = 25)
Arguments
counts
: table with positions (x) and values (counts)max_period
: results haivng a period over this threshold are discarded
Value
A tibble with signal
, freq
and period
columns.
FFT
FFT analysis over one or more regions
Description
This function computes the periodicity of the (meta)gene count profiles
Usage
FFT(
count_table = NULL,
region_tb,
by_sample = FALSE,
by_region = FALSE,
min_quantile = 0.25,
max_period = 25,
...
)
Arguments
count_table
: Count table (if not empty bam_files will be ignored)region_tb
: a tibble withrname
,strand
andposition
for all regions to extractby_sample
: if FALSE, the counts are aggregated (sum) over the different samplesby_region
: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.min_quantile
: if by_region is FALSE and the regions are aggregated, all regions must be the same length. In this case, min_quantile is used to compute this lenght wich is given by the quantile of the distribution of all the region lentghts. By default, it is 0.25 which means that 75% of the regions are at least this length.max_period
: results haivng a period over this threshold are discarded...
: parameters of the quantify_from_bam_files function if no count table is provided
Value
A tibble with at least signal
, freq
and period
columns.
If by_sample
and/or by_region
is TRUE, the corresponding columns are also
present in the results, i.e. respectively sample_id
and/or
(rname, strand, location)
.
frame_preference
Global frame preference
Description
This function computes the global frame preference of the reads distribution by computing the number of counts falling into each translation frame.
Usage
frame_preference(
count_table = NULL,
region_tb,
by_sample = FALSE,
by_region = FALSE,
...
)
Arguments
count_table
: (tibble) Count table (if not emptybam_files
will be ignored)region_tb
: (character) Path of a gff fileby_sample
: if FALSE, the counts are aggregated (sum) over the different samplesby_region
: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.
Value
A tibble with frame
and pc
columns.
init_read_features
Initialize a read_features
object.
Description
Initializes the read_features
object with the minimal feature readseq corresponding to the region in the read sequence that is expected to align to a reference sequence.
Usage
init_read_features(
start,
width,
pattern = "ACTG",
max_mismatch = 0,
read_features_file = NA
)
Arguments
start
: Relative start of the readseq feature.width
: Width of the readseq feature.pattern
: Characters allowed in the readseq sequence.max_mismatch
: Maximum number of mismatches allowed in the readseq sequence.read_features_file
: Path of the read_feature CSV file if already built and saved.
Seealso
add_read_feature for adding other features, parse_reads, demultiplex
Value
A read_features tibble with at least 1 row corresponding to the region of the read sequence that is to be aligned to reference sequences.
Examples
# only a small region is to be aligned
rf = init_read_features(start = 14, width = 19)
# the whole 100nt long read should be aligned to the reference
rf = init_read_features(start = 1, width = 100)
rf
load_bam_as_tibble
Bam file parser
Description
This function parse a given bam file and return different mapping informations in a tibble
Usage
load_bam_as_tibble(bam_file, features = NULL)
Arguments
bam_file
: (character) Path of the bam fileUMI_length
: (integer) Width of the UMI (if UMIs have not been put in the qname let this value default)
Value
(tibble) with mapping informations
map_fastq_files
Wrapper of the map_fastq_file function which allows the mapping of multiples FASTQ files.
Description
The function takes a vector of FASTQ file paths and others map_fastq parameters. All parameters can be replaced by a “mapping table” which is a tibble where each column correspond to a parameters of the map_fastq function.
Usage
map_fastq_files(
list_fastq_files,
fasta_name,
index_name,
max_map_mismatch = 0,
bam_dir = "",
aligner = "bowtie",
threads = 3,
map_unique = TRUE,
force = FALSE,
mapping_tb = NULL
)
Arguments
list_fastq_files
: (character) or (vector of character) Path of the FASTQ file (eventually gzipped) containing single-end EMOTE readsfasta_name
: (character) Path of the FASTA genome reference fileindex_name
: (character) Name of the index (index will be created if not exist)max_map_mismatch
: a int corresponding to the number of allowed mismatchbam_dir
: (character) Path of the output directoryaligner
: (character) Mapping aligner (either “bowtie” or “subread”)threads
: (integer) Number of threads usedforce
: (logical) If TRUE overwrite already existing bam filemapping_tb
: (tibble) where colnames = parameters of the map_fastq_file function
Value
(character) bam file created
map_fastq
Perform the mapping of a FASTQ file using either bowtie or subread method.
Description
The function takes a FASTQ file and a genome FASTA file. An index is created if not existing, reads are mapped on the given reference genome.
Usage
map_fastq(
fastq_file,
fasta_name,
index_name,
max_map_mismatch = 0,
bam_dir = "",
aligner = "bowtie",
threads = 3,
map_unique = TRUE,
force = FALSE
)
Arguments
fastq_file
: Path of the FASTQ file (eventually gzipped) containing single-end readsfasta_name
: Path of the FASTA genome reference fileindex_name
: Name of the index (index will be created if not exist)max_map_mismatch
: a int corresponding to the number of allowed mismatchbam_dir
: Path of the output directoryaligner
: Mapping aligner (either “bowtie” or “subread”)threads
: Number of threads usedforce
: If TRUE overwrite already existing bam file
Value
Path of BAM file created
map_stats_bam_files
Mapping stat
Description
This function parses the specified bam_files and returns simple stats such as the percentage of reads mapped. The result is saved in bam_file.map_stats.csv so that it is not done again on the same BAM file if it is older that the stats or force is set to FALSE.
Usage
map_stats_bam_files(bam_files, force = FALSE)
Arguments
bam_files
: vector of BAM filenamesforce
: (default FALSE) Force the computation of stats.
Value
tibble with mapping statistics
map_stats
Mapping stat
Description
This function parses the specified BAM FILE in bam_file and fastq_files and returns mapping stats such reads mapped. The result is saved in bam_file.map_stats.csv so that it is not done again on the same BAM file if it is older that the stats or force is set to FALSE.
Usage
map_stats(bam_file, force = FALSE)
Arguments
bam_file
: Path of the BAM filename.force
: (default FALSE) Force the computation of stats.
Value
mapping statistics.
metacounts_feature
retrieve normalized (meta)counts around given feature(s) specified by a
tibble/data.frame with their rname
, strand
and position.
Description
This function takes as input the features to extract and the
width
to consider around each feature specified by its rname
, strand
and position
.
Usage
metacounts_feature(
count_table = NULL,
feature_tb,
width = 100,
by_sample = FALSE,
by_feature = FALSE,
...
)
Arguments
count_table
: count table (if not emptybam_files
will be ignored)feature_tb
: a tibble withrname
,strand
andposition
for all regions to extractwidth
: positions to extract upstream and downstream of the featuresby_sample
: if FALSE, the counts are aggregated over the different samplesby_feature
: if FALSE, the counts are aggregated over all regions aligned on the starting location of the feature.
Value
A tibble with at least x
and count
columns. x
is the relative
distance to the location on the reference sequence rname
and strand
, and
count
is the normalized count on the region (counts sum is 1). If
by_sample
and/or by_feature
is TRUE, the corresponding columns are also
present in the results, i.e. respectively sample_id
and/or
(rname, strand, location)
.
metacounts_normalized
normalize (meta)counts on region(s) specified by strand, location and width
Description
This function takes a input one or more regions and normalizes the counts by region and sample_id so that they sum up to 1. It is used to balance the effect of highly transcribed over less transcribed regions, and also varying sequencing depths.
Usage
metacounts_normalized(
count_table,
rname,
strand,
location,
width,
around = FALSE,
by_sample = FALSE,
by_region = FALSE,
min_quantile = 0.25
)
Arguments
count_table
: count table as returned from quantification, i.e. a tibble with the following columns: sample_id, rname, position, strand, count.rname
: rname(s) one reference/chromosome name or a vector of names.strand
: strand(s) a vector of the same length as rname to specify each strand.location
: location(s) a vector of the same length as rname to specify each position on the reference.width
: width(s) a vector of the same length as rname to specify the width upstream and downstream the location (if param around is TRUE) or the length of the region (if param around is FALSE)0around
: specify if the location is the begninning of the region or its the center.by_sample
: if FALSE, the counts are aggregated (sum) over the different samplesby_region
: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.min_quantile
: if by_region is FALSE and the regions are aggregated, all regions must be the same length. In this case, min_quantile is used to compute this lenght wich is given by the quantile of the distribution of all the region lentghts. By default, it is 0.25 which means that 75% of the regions are at least this length.
Value
A tibble with at least x
and count
columns. x
is the relative
distance to the location on the reference sequence rname
and strand
, and
count
is the normalized count on the region (counts sum is 1). If
by_sample
and/or by_region
is TRUE, the corresponding columns are also
present in the results, i.e. respectively sample_id
and/or
(rname, strand, location)
.
metacounts_region
extract normalized counts for a given feature and optionnaly aggregates them into metacounts
Description
This function takes a input one or more regions and normalizes the counts by region and sample_id so that they sum up to 1. It is used to balance the effect of highly transcribed over less transcribed regions, and also varying sequencing depths.
Usage
metacounts_region(
count_table = NULL,
region_tb,
by_sample = FALSE,
by_region = FALSE,
min_quantile = 0.25,
...
)
Arguments
count_table
: count table (if not empty bam_files will be ignored)region_tb
: a tibble withrname
,strand
andposition
for all regions to extractby_sample
: if FALSE, the counts are aggregated (sum) over the different samplesby_region
: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.width
: positions to extract downstream of the specified start position
Value
A tibble with at least x and count columns. x is the relative
distance to the feature and count is normalized count on the region. If
by_sample
and/or by_region
is TRUE, the corresponding columns are also
present in the results, i.e. respectively sample_id
and/or
(rname, strand, location)
.
parse_reads
Parse each read from a FASTQ file to retrieve features in the read sequence and test their validity.
Description
Valid reads are written to a dedicated file and if keep_invalid_reads
is TRUE, invalid reads are written to fastq_file.invalid.fastq.gz
.
It also writes a feature report which contains features given in the features parameter and a parsing report which contains number of read with valid features.
Usage
parse_reads(
fastq_file,
features,
out_dir = NULL,
keep_invalid_reads = FALSE,
min_read_length = 18,
yieldsize = 1e+06,
write_read_parsing_tb = FALSE,
force = FALSE
)
Arguments
fastq_file
: Path of the FASTQ file (eventually gzipped) containing single-end readsfeatures
: read_features tibbleout_dir
: Path of the output directory where the demultiplexed FASTQ files will be storedkeep_invalid_reads
: If TRUE invalid reads will be kept in a dedicated filemin_read_length
: Minimum width of reads after parsingyieldsize
: Number of read to process per iterationwrite_read_parsing_tb
: writes statistics table to fileforce
: If TRUE recursively remove the content of out_dir before running instead of stopping the function with an error message
Value
A tibble with parsing statistics.
print.read_features
Utility function to display read_features.
Description
Utility function to display read_features.
Usage
# S3 method for read_features
print(rf)
Arguments
rf
:read_features
tibble
quantify_3prime
Function used to quantify the exact 3’-ends of mapped reads
Description
Caller of load_bam_as_tibble and summarise_alignements
Usage
quantify_3prime(
bam_file,
features = NULL,
keep_UMI = FALSE,
min_mapped_length = NA,
rev_comp_reads = FALSE,
force = FALSE
)
Arguments
bam_file
: (character) Path of the bam file containing single-end EMOTE reads mappedkeep_UMI
: (logical) If FALSE the count value corresponds to the number fo different UMI at an exact position
Value
a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)
quantify_5prime
Function used to quantify the exact 5’-ends of mapped reads
Description
Caller of load_bam_as_tibble and summarise_alignements
Usage
quantify_5prime(bam_file, features = NULL, keep_UMI = FALSE, force = FALSE)
Arguments
keep_UMI
: (logical) If FALSE the count value corresponds to the number fo different UMI at an exact positionbam_files
: (character) Path of the bam file containing single-end EMOTE reads mappedUMI_length
: (integer) Width of UMI, default is 0 (if 0 the read id qname will be return)
Value
a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)
quantify_downstream
Function used to quantify the exact 3’-ends of mapped reads in XU et al. 2023
Description
Function used to quantify the exact 3’-ends of mapped reads in XU et al. 2023
Usage
quantify_downstream(bam_file, umi.length, min.map.length, force = FALSE)
Arguments
bam_file
: (character) Path of the bam file containing single-end EMOTE reads mapped
Value
a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)
quantify_from_bam_files
Wrapper of quantify_5prime and quantify_3prime quantification functions.
Description
This function takes a vector of bam files and call the “fun” quantification function. It allow the parralelization through the “cores” parameters
Usage
quantify_from_bam_files(bam_files, sample_ids, fun, ...)
Arguments
bam_files
: (vector of characters) Path of the bam files containing single-end EMOTE reads mappedsample_ids
: (vector of characters) corresponding to a string for identifying the source of readsfun
: (character) The quantification method used (quantify_3prime or quantify_5prime)...
: parameters to give to the “fun” functioncores
: (integer) Number of cores
Value
a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)
read_feature_pattern_types
Read feature pattern types
Description
Enumeration of the allowed pattern types.
Usage
read_feature_pattern_types
Keyword
datasets
Format
An object of class character
of length 3.
Details
A vector of characters with 3 elements: nucleotides, constant and regexp:
nucleotides
: for random sequences composed of the specifies nucleotides (defaults are A, T, C , G) i.e the mappable readseq or the UMIs.constant
: for a constant string or a vector of constant strings, i.e. allowed barcodes for multiplexed samples or some control sequence that should be present in the reads. A maximum number of mismatches allowed can be specified.regepx
: for more complicated patterns that can occur anywhere such as polyA tails.
reads_parse_report_ggupset
Utility function to plot an upset plot from parse_reads
and demultiplex
reports using ggupset package.
Description
Utility function to plot an upset plot from parse_reads
and demultiplex
reports using ggupset package.
Usage
reads_parse_report_ggupset(report, angle = 45, margin = 0.1)
Arguments
report
: report returned byparse_reads
ordemultiplex
reads_parse_report_to_upsetr_expression
Utility function to convert parse_reads
and demultiplex
reports to
expression for UpSetR plot.
Description
Utility function to convert parse_reads
and demultiplex
reports to
expression for UpSetR plot.
Usage
reads_parse_report_to_upsetr_expression(report)
Arguments
report
: report returned byparse_reads
ordemultiplex
summarise_alignements
Parse mapping information retrieve by load_bam_as_tibble
Description
This function parse the output of load_bam_as_tibble and count the number of reads mapped at each exact genomic position
Usage
summarise_alignements(alns, UMI_length, keep_UMI)
Arguments
alns
: (tibble) Mapping table returned by load_bam_as_tibbleUMI_length
: (integer) Width of the UMI (if UMIs have not been put in the qname let this value default)keep_UMI
: (logical) If TRUE the result tibble will display the count value for each UMI at each exact position
Value
(tibble) count table
TSS_betabinomial
Computes a p-value for a single position using the betabinomial model as in TSS-EMOTE paper.
Description
Computes a p-value for a single position using the betabinomial model as in TSS-EMOTE paper.
Usage
TSS_betabinomial(k_total, k_mono, n_total, n_mono)
Arguments
k_total
: Counts on the +RppH (total) at a specific positionk_mono
: Counts on the -RppH (mono) at a specific positionn_total
: Total counts on the +RppH (total) in the RNA-Seq libraryn_mono
: Total counts on the -RppH (mono) in the RNA-Seq library
Value
The p-value computed for the given position.
TSS
Identification of Transcription Start Sites (TSS) as in TSS-EMOTE paper
Description
This function computes a p-value for each position present in a count table. P-values of replicate samples are combined and adjusted for multiple testing.
Usage
TSS(
count_table,
design_table,
combine_method = "fisher",
fdr_method = "BH",
threshold = 5
)
Arguments
count_table
: Count tabledesign_table
: Description of the samples. It must contains the following columns: (sample_id
,group
,replicate
,count_of
(total/fraction).combine_method
: P-values combination method to use, default is “fisher” (“fisher”, “z.transform”, “logit”)fdr_method
: P-values adjustment method, default is “BH” (“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,”fdr”, “none”))threshold
: Minimum counts in the total sample for the position to be considered
Value
A tibble containing the results for each position, i.e. with
columns position
, strand
, pvalue
, pvalue_adj
.