DESCRIPTION

Package: rnaends
Title: RNA-ends sequencing specialized package featuring pre-processing, mapping, quantification and analysis of 5'-end or 3'-end RNAs sequencing data
Version: 1.0.0
Authors@R: c(
    person("Roland", "Barriot", , "roland.barriot@gmail.com", role = c("aut", "cre"),
           comment = c(ORCID = "0000-0003-0298-2696")),
    person("Tomas", "Caetano", role = "aut",
           comment = c(ORCID = "0009-0008-4678-2796")),
    person("Gwennaele", "Fichant", role = "ctb",
           comment = c(ORCID = "0000-0001-5480-0239")),
    person("Peter", "Redder", role = "ctb",
           comment = c(ORCID = "0000-0002-3619-632X"))
  )
Description: The rnaends R package focuses on exact 5'-end and/or 3'-end quantification of RNAs via targeted RNA-Seq.
License: GPL (>= 3)
Imports: 
    tidyverse (>= 2.0.0),
    ggupset (>= 0.4),
    DSS (>= 2.48.0),
    Rbowtie (>= 1.42.0),
    Rsamtools (>= 2.16.0),
    Rsubread (>= 2.16.0),
    ShortRead (>= 1.58.0),
    survcomp (>= 1.52.0),
    VGAM (>= 1.1-9)
Remotes:
    bioc/Biostrings,
    bioc/DSS,
    bioc/Rbowtie,
    bioc/Rsamtools,
    bioc/Rsubread,
    bioc/ShortRead,
    bioc/survcomp
Suggests: 
    knitr (>= 1.49),
    rmarkdown (>= 2.29),
    testthat (>= 3.0.0)
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
LazyData: true
VignetteBuilder: knitr
Depends: 
    R (>= 4.4.2)

add_read_feature

Add a feature to the read_features object.

Description

This function adds a feature to the read_features tibble returned by init_read_features or add_read_feature.

Usage

add_read_feature(
  read_features,
  name,
  start,
  width,
  pattern,
  pattern_type = "constant",
  max_mismatch = 0,
  readid_prepend = FALSE
)

Arguments

  • read_features: read_features tibble

  • name: Name of the feature.

  • start: Relative start of the feature.

  • width: Width of the feature.

  • pattern: Characters allowed in the feature sequence.

  • pattern_type: Type of the pattern given: “nucleotides”, “constant” or “regexp”.

  • max_mismatch: Maximum number of mismatches allowed in the feature sequence.

  • readid_prepend: If TRUE the sequence of this feature will be prepended to the read id (in front of) in the FASTQ file.

Seealso

read_feature_pattern_types for pattern types, parse_reads, demultiplex

Value

The given read_features object with the added feature

demultiplex

Demultiplex a FASTQ file according to barcode sequence reads provided through a read_features object

Description

This function parses each read to retrieve features in the sequence and test their validity. Valid reads are written in separate files (fastq_file.barcode.fastq.gz) and if keep_invalid_reads = TRUE, invalid reads are write in fastq_file.invalid.fastq.gz It also writes a feature report which contains features given in the features parameter and a demultiplex report which contains number of valid features for each barcode.

Usage

demultiplex(
  fastq_file,
  features,
  out_dir = NULL,
  keep_invalid_reads = FALSE,
  yieldsize = 1e+06,
  min_read_length = 12,
  write_read_parsing_tb = FALSE,
  force = FALSE
)

Arguments

  • fastq_file: Path of the FASTQ file (eventually gzipped) containing single-end reads

  • features: read_features object

  • out_dir: Path of the output directory where the demultiplexed FASTQ files will be stored

  • keep_invalid_reads: If TRUE invalid reads will be kept in a dedicated file

  • yieldsize: Number of read to process per iteration

  • force: If TRUE recursively remove the content of out_dir before running instead of stopping the function with an error message

Value

A tibble with demultiplexing statistics

differential_proportion

Wrapper of differential_proportion_one_stranded function allow to compare proportion of both strand

Description

Wrapper of differential_proportion_one_stranded function allow to compare proportion of both strand

Usage

differential_proportion(
  count_table,
  design_matrix,
  group1,
  group2,
  cores = 1,
  ...
)

Arguments

  • count_table: Count table

  • design_matrix: Description of the sample_ids (it must contains a sample_id,group,replicate,count_of(total/fraction) columns)

  • group1: Name of strain 1

  • group2: Name of strain 2

  • cores: number of cores used in the mclappy function

  • ...: arguments that can be pass to the DMLtest function

Value

a tibble with for each position the pvalue

differential_proportion2

Statistical comparison of 2 sets of proportion values taking into account replicates

Description

Statistical comparison of 2 sets of proportion values taking into account replicates

Usage

differential_proportion2(count_table, design_matrix, group1, group2, ...)

Arguments

  • count_table: Count table

  • design_matrix: Description of the samples (it must contains a sample_id, group, replicate, count_of(fraction/total) columns)

  • group1: Name of strain 1

  • group2: Name of strain 2

  • ...: arguments that can be pass to the DMLtest function

Value

a tibble with for each position the pvalue

.parse_yield

Parse features in a yield from a FASTQ stream reader

Description

Parse features in a yield from a FASTQ stream reader

Usage

.parse_yield(yield, features, min_read_length)

Arguments

  • yield: A yield of ShortReads from a FastQStreamer.

  • features: read_features to be parsed.

  • min_read_length: minimum length of readseq after parsing.

Value

a tibble with statistics for each read feature.

.trim_yield

Trim ShortReads in a yield of a FastQStreamer to keep only a readseq feature of a read_features object.

Description

Trim ShortReads in a yield of a FastQStreamer to keep only a readseq feature of a read_features object.

Usage

.trim_yield(yield, features, read_parsed)

Arguments

  • yield: A yield of ShortReads from a FastQStreamer.

  • features: read_features to be parsed.

  • read_parsed: minimum length of readseq after parsing.

Value

FastQStream yield with read sequences trimmed to the region corresponding to the readseq feature of an read_features object.

FFT_profile

FFT analysis applied to counts or metacounts to detect periodicity

Description

This function is a wrapper for the stats::fft to detect periodicity in the signal composed of counts (of RNA ends) on a given region (or metacounts profile)

Usage

FFT_profile(counts, max_period = 25)

Arguments

  • counts: table with positions (x) and values (counts)

  • max_period: results haivng a period over this threshold are discarded

Value

A tibble with signal, freq and period columns.

FFT

FFT analysis over one or more regions

Description

This function computes the periodicity of the (meta)gene count profiles

Usage

FFT(
  count_table = NULL,
  region_tb,
  by_sample = FALSE,
  by_region = FALSE,
  min_quantile = 0.25,
  max_period = 25,
  ...
)

Arguments

  • count_table: Count table (if not empty bam_files will be ignored)

  • region_tb: a tibble with rname, strand and position for all regions to extract

  • by_sample: if FALSE, the counts are aggregated (sum) over the different samples

  • by_region: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.

  • min_quantile: if by_region is FALSE and the regions are aggregated, all regions must be the same length. In this case, min_quantile is used to compute this lenght wich is given by the quantile of the distribution of all the region lentghts. By default, it is 0.25 which means that 75% of the regions are at least this length.

  • max_period: results haivng a period over this threshold are discarded

  • ...: parameters of the quantify_from_bam_files function if no count table is provided

Value

A tibble with at least signal, freq and period columns. If by_sample and/or by_region is TRUE, the corresponding columns are also present in the results, i.e. respectively sample_id and/or (rname, strand, location).

frame_preference

Global frame preference

Description

This function computes the global frame preference of the reads distribution by computing the number of counts falling into each translation frame.

Usage

frame_preference(
  count_table = NULL,
  region_tb,
  by_sample = FALSE,
  by_region = FALSE,
  ...
)

Arguments

  • count_table: (tibble) Count table (if not empty bam_files will be ignored)

  • region_tb: (character) Path of a gff file

  • by_sample: if FALSE, the counts are aggregated (sum) over the different samples

  • by_region: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.

Value

A tibble with frame and pc columns.

init_read_features

Initialize a read_features object.

Description

Initializes the read_features object with the minimal feature readseq corresponding to the region in the read sequence that is expected to align to a reference sequence.

Usage

init_read_features(
  start,
  width,
  pattern = "ACTG",
  max_mismatch = 0,
  read_features_file = NA
)

Arguments

  • start: Relative start of the readseq feature.

  • width: Width of the readseq feature.

  • pattern: Characters allowed in the readseq sequence.

  • max_mismatch: Maximum number of mismatches allowed in the readseq sequence.

  • read_features_file: Path of the read_feature CSV file if already built and saved.

Seealso

add_read_feature for adding other features, parse_reads, demultiplex

Value

A read_features tibble with at least 1 row corresponding to the region of the read sequence that is to be aligned to reference sequences.

Examples

# only a small region is to be aligned
rf = init_read_features(start = 14, width = 19)
# the whole 100nt long read should be aligned to the reference
rf = init_read_features(start = 1, width = 100)
rf

load_bam_as_tibble

Bam file parser

Description

This function parse a given bam file and return different mapping informations in a tibble

Usage

load_bam_as_tibble(bam_file, features = NULL)

Arguments

  • bam_file: (character) Path of the bam file

  • UMI_length: (integer) Width of the UMI (if UMIs have not been put in the qname let this value default)

Value

(tibble) with mapping informations

map_fastq_files

Wrapper of the map_fastq_file function which allows the mapping of multiples FASTQ files.

Description

The function takes a vector of FASTQ file paths and others map_fastq parameters. All parameters can be replaced by a “mapping table” which is a tibble where each column correspond to a parameters of the map_fastq function.

Usage

map_fastq_files(
  list_fastq_files,
  fasta_name,
  index_name,
  max_map_mismatch = 0,
  bam_dir = "",
  aligner = "bowtie",
  threads = 3,
  map_unique = TRUE,
  force = FALSE,
  mapping_tb = NULL
)

Arguments

  • list_fastq_files: (character) or (vector of character) Path of the FASTQ file (eventually gzipped) containing single-end EMOTE reads

  • fasta_name: (character) Path of the FASTA genome reference file

  • index_name: (character) Name of the index (index will be created if not exist)

  • max_map_mismatch: a int corresponding to the number of allowed mismatch

  • bam_dir: (character) Path of the output directory

  • aligner: (character) Mapping aligner (either “bowtie” or “subread”)

  • threads: (integer) Number of threads used

  • force: (logical) If TRUE overwrite already existing bam file

  • mapping_tb: (tibble) where colnames = parameters of the map_fastq_file function

Value

(character) bam file created

map_fastq

Perform the mapping of a FASTQ file using either bowtie or subread method.

Description

The function takes a FASTQ file and a genome FASTA file. An index is created if not existing, reads are mapped on the given reference genome.

Usage

map_fastq(
  fastq_file,
  fasta_name,
  index_name,
  max_map_mismatch = 0,
  bam_dir = "",
  aligner = "bowtie",
  threads = 3,
  map_unique = TRUE,
  force = FALSE
)

Arguments

  • fastq_file: Path of the FASTQ file (eventually gzipped) containing single-end reads

  • fasta_name: Path of the FASTA genome reference file

  • index_name: Name of the index (index will be created if not exist)

  • max_map_mismatch: a int corresponding to the number of allowed mismatch

  • bam_dir: Path of the output directory

  • aligner: Mapping aligner (either “bowtie” or “subread”)

  • threads: Number of threads used

  • force: If TRUE overwrite already existing bam file

Value

Path of BAM file created

map_stats_bam_files

Mapping stat

Description

This function parses the specified bam_files and returns simple stats such as the percentage of reads mapped. The result is saved in bam_file.map_stats.csv so that it is not done again on the same BAM file if it is older that the stats or force is set to FALSE.

Usage

map_stats_bam_files(bam_files, force = FALSE)

Arguments

  • bam_files: vector of BAM filenames

  • force: (default FALSE) Force the computation of stats.

Value

tibble with mapping statistics

map_stats

Mapping stat

Description

This function parses the specified BAM FILE in bam_file and fastq_files and returns mapping stats such reads mapped. The result is saved in bam_file.map_stats.csv so that it is not done again on the same BAM file if it is older that the stats or force is set to FALSE.

Usage

map_stats(bam_file, force = FALSE)

Arguments

  • bam_file: Path of the BAM filename.

  • force: (default FALSE) Force the computation of stats.

Value

mapping statistics.

metacounts_feature

retrieve normalized (meta)counts around given feature(s) specified by a tibble/data.frame with their rname, strand and position.

Description

This function takes as input the features to extract and the width to consider around each feature specified by its rname, strandand position.

Usage

metacounts_feature(
  count_table = NULL,
  feature_tb,
  width = 100,
  by_sample = FALSE,
  by_feature = FALSE,
  ...
)

Arguments

  • count_table: count table (if not empty bam_files will be ignored)

  • feature_tb: a tibble with rname, strand and position for all regions to extract

  • width: positions to extract upstream and downstream of the features

  • by_sample: if FALSE, the counts are aggregated over the different samples

  • by_feature: if FALSE, the counts are aggregated over all regions aligned on the starting location of the feature.

Value

A tibble with at least x and count columns. x is the relative distance to the location on the reference sequence rname and strand, and count is the normalized count on the region (counts sum is 1). If by_sample and/or by_feature is TRUE, the corresponding columns are also present in the results, i.e. respectively sample_id and/or (rname, strand, location).

metacounts_normalized

normalize (meta)counts on region(s) specified by strand, location and width

Description

This function takes a input one or more regions and normalizes the counts by region and sample_id so that they sum up to 1. It is used to balance the effect of highly transcribed over less transcribed regions, and also varying sequencing depths.

Usage

metacounts_normalized(
  count_table,
  rname,
  strand,
  location,
  width,
  around = FALSE,
  by_sample = FALSE,
  by_region = FALSE,
  min_quantile = 0.25
)

Arguments

  • count_table: count table as returned from quantification, i.e. a tibble with the following columns: sample_id, rname, position, strand, count.

  • rname: rname(s) one reference/chromosome name or a vector of names.

  • strand: strand(s) a vector of the same length as rname to specify each strand.

  • location: location(s) a vector of the same length as rname to specify each position on the reference.

  • width: width(s) a vector of the same length as rname to specify the width upstream and downstream the location (if param around is TRUE) or the length of the region (if param around is FALSE)0

  • around: specify if the location is the begninning of the region or its the center.

  • by_sample: if FALSE, the counts are aggregated (sum) over the different samples

  • by_region: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.

  • min_quantile: if by_region is FALSE and the regions are aggregated, all regions must be the same length. In this case, min_quantile is used to compute this lenght wich is given by the quantile of the distribution of all the region lentghts. By default, it is 0.25 which means that 75% of the regions are at least this length.

Value

A tibble with at least x and count columns. x is the relative distance to the location on the reference sequence rname and strand, and count is the normalized count on the region (counts sum is 1). If by_sample and/or by_region is TRUE, the corresponding columns are also present in the results, i.e. respectively sample_id and/or (rname, strand, location).

metacounts_region

extract normalized counts for a given feature and optionnaly aggregates them into metacounts

Description

This function takes a input one or more regions and normalizes the counts by region and sample_id so that they sum up to 1. It is used to balance the effect of highly transcribed over less transcribed regions, and also varying sequencing depths.

Usage

metacounts_region(
  count_table = NULL,
  region_tb,
  by_sample = FALSE,
  by_region = FALSE,
  min_quantile = 0.25,
  ...
)

Arguments

  • count_table: count table (if not empty bam_files will be ignored)

  • region_tb: a tibble with rname, strand and position for all regions to extract

  • by_sample: if FALSE, the counts are aggregated (sum) over the different samples

  • by_region: if FALSE, the counts are aggregated (median) over all regions aligned on the starting location of each region.

  • width: positions to extract downstream of the specified start position

Value

A tibble with at least x and count columns. x is the relative distance to the feature and count is normalized count on the region. If by_sample and/or by_region is TRUE, the corresponding columns are also present in the results, i.e. respectively sample_id and/or (rname, strand, location).

parse_reads

Parse each read from a FASTQ file to retrieve features in the read sequence and test their validity.

Description

Valid reads are written to a dedicated file and if keep_invalid_reads is TRUE, invalid reads are written to fastq_file.invalid.fastq.gz. It also writes a feature report which contains features given in the features parameter and a parsing report which contains number of read with valid features.

Usage

parse_reads(
  fastq_file,
  features,
  out_dir = NULL,
  keep_invalid_reads = FALSE,
  min_read_length = 18,
  yieldsize = 1e+06,
  write_read_parsing_tb = FALSE,
  force = FALSE
)

Arguments

  • fastq_file: Path of the FASTQ file (eventually gzipped) containing single-end reads

  • features: read_features tibble

  • out_dir: Path of the output directory where the demultiplexed FASTQ files will be stored

  • keep_invalid_reads: If TRUE invalid reads will be kept in a dedicated file

  • min_read_length: Minimum width of reads after parsing

  • yieldsize: Number of read to process per iteration

  • write_read_parsing_tb: writes statistics table to file

  • force: If TRUE recursively remove the content of out_dir before running instead of stopping the function with an error message

Value

A tibble with parsing statistics.

quantify_3prime

Function used to quantify the exact 3’-ends of mapped reads

Description

Caller of load_bam_as_tibble and summarise_alignements

Usage

quantify_3prime(
  bam_file,
  features = NULL,
  keep_UMI = FALSE,
  min_mapped_length = NA,
  rev_comp_reads = FALSE,
  force = FALSE
)

Arguments

  • bam_file: (character) Path of the bam file containing single-end EMOTE reads mapped

  • keep_UMI: (logical) If FALSE the count value corresponds to the number fo different UMI at an exact position

Value

a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)

quantify_5prime

Function used to quantify the exact 5’-ends of mapped reads

Description

Caller of load_bam_as_tibble and summarise_alignements

Usage

quantify_5prime(bam_file, features = NULL, keep_UMI = FALSE, force = FALSE)

Arguments

  • keep_UMI: (logical) If FALSE the count value corresponds to the number fo different UMI at an exact position

  • bam_files: (character) Path of the bam file containing single-end EMOTE reads mapped

  • UMI_length: (integer) Width of UMI, default is 0 (if 0 the read id qname will be return)

Value

a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)

quantify_downstream

Function used to quantify the exact 3’-ends of mapped reads in XU et al. 2023

Description

Function used to quantify the exact 3’-ends of mapped reads in XU et al. 2023

Usage

quantify_downstream(bam_file, umi.length, min.map.length, force = FALSE)

Arguments

  • bam_file: (character) Path of the bam file containing single-end EMOTE reads mapped

Value

a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)

quantify_from_bam_files

Wrapper of quantify_5prime and quantify_3prime quantification functions.

Description

This function takes a vector of bam files and call the “fun” quantification function. It allow the parralelization through the “cores” parameters

Usage

quantify_from_bam_files(bam_files, sample_ids, fun, ...)

Arguments

  • bam_files: (vector of characters) Path of the bam files containing single-end EMOTE reads mapped

  • sample_ids: (vector of characters) corresponding to a string for identifying the source of reads

  • fun: (character) The quantification method used (quantify_3prime or quantify_5prime)

  • ...: parameters to give to the “fun” function

  • cores: (integer) Number of cores

Value

a tibble with quantification of the reads in the given bam files, by exact position (and by UMI if UMI_length>0)

read_feature_pattern_types

Read feature pattern types

Description

Enumeration of the allowed pattern types.

Usage

read_feature_pattern_types

Keyword

datasets

Format

An object of class character of length 3.

Details

A vector of characters with 3 elements: nucleotides, constant and regexp:

  • nucleotides: for random sequences composed of the specifies nucleotides (defaults are A, T, C , G) i.e the mappable readseq or the UMIs.

  • constant: for a constant string or a vector of constant strings, i.e. allowed barcodes for multiplexed samples or some control sequence that should be present in the reads. A maximum number of mismatches allowed can be specified.

  • regepx: for more complicated patterns that can occur anywhere such as polyA tails.

reads_parse_report_ggupset

Utility function to plot an upset plot from parse_reads and demultiplex reports using ggupset package.

Description

Utility function to plot an upset plot from parse_reads and demultiplexreports using ggupset package.

Usage

reads_parse_report_ggupset(report, angle = 45, margin = 0.1)

Arguments

  • report: report returned by parse_reads or demultiplex

reads_parse_report_to_upsetr_expression

Utility function to convert parse_reads and demultiplex reports to expression for UpSetR plot.

Description

Utility function to convert parse_reads and demultiplex reports to expression for UpSetR plot.

Usage

reads_parse_report_to_upsetr_expression(report)

Arguments

  • report: report returned by parse_reads or demultiplex

summarise_alignements

Parse mapping information retrieve by load_bam_as_tibble

Description

This function parse the output of load_bam_as_tibble and count the number of reads mapped at each exact genomic position

Usage

summarise_alignements(alns, UMI_length, keep_UMI)

Arguments

  • alns: (tibble) Mapping table returned by load_bam_as_tibble

  • UMI_length: (integer) Width of the UMI (if UMIs have not been put in the qname let this value default)

  • keep_UMI: (logical) If TRUE the result tibble will display the count value for each UMI at each exact position

Value

(tibble) count table

TSS_betabinomial

Computes a p-value for a single position using the betabinomial model as in TSS-EMOTE paper.

Description

Computes a p-value for a single position using the betabinomial model as in TSS-EMOTE paper.

Usage

TSS_betabinomial(k_total, k_mono, n_total, n_mono)

Arguments

  • k_total: Counts on the +RppH (total) at a specific position

  • k_mono: Counts on the -RppH (mono) at a specific position

  • n_total: Total counts on the +RppH (total) in the RNA-Seq library

  • n_mono: Total counts on the -RppH (mono) in the RNA-Seq library

Value

The p-value computed for the given position.

TSS

Identification of Transcription Start Sites (TSS) as in TSS-EMOTE paper

Description

This function computes a p-value for each position present in a count table. P-values of replicate samples are combined and adjusted for multiple testing.

Usage

TSS(
  count_table,
  design_table,
  combine_method = "fisher",
  fdr_method = "BH",
  threshold = 5
)

Arguments

  • count_table: Count table

  • design_table: Description of the samples. It must contains the following columns: (sample_id, group, replicate, count_of(total/fraction).

  • combine_method: P-values combination method to use, default is “fisher” (“fisher”, “z.transform”, “logit”)

  • fdr_method: P-values adjustment method, default is “BH” (“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,”fdr”, “none”))

  • threshold: Minimum counts in the total sample for the position to be considered

Value

A tibble containing the results for each position, i.e. with

columns position, strand, pvalue, pvalue_adj.