--- title: "TSS identification using 5'-end RNA-Seq data" output: html_document: keep_md: true vignette: > %\VignetteIndexEntry{TSS_identification} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} editor_options: chunk_output_type: inline --- # TSS identification using 5'-end RNA-Seq data Complete source of this page is available in the notebook [TSS_identification.Rmd](https://gitlab.com/rnaends/rnaends/-/blob/main/vignettes/TSS_identification.Rmd). ``` r library(tidyverse) ``` ``` r library(rnaends) ``` Based on the publication from Prados J, Linder P, Redder P (2016) TSS-EMOTE, a refined protocol for a more complete and less biased global mapping of transcription start sites in bacterial pathogens. BMC Genomics, 17, 849. [https://doi.org/10.1186/s12864-016-3211-3](https://doi.org/10.1186/s12864-016-3211-3). The library preparation protocol is as follows. Briefly, in this protocol, total RNA is purified and digested with a 5’-exonuclease which removes the vast majority of 5’ mono-phosphorylated RNAs. The RNA is mixed with a synthetic RNA oligo (specific adapter) and split into two pools. In the first pool (noted +RppH), both a ligase and an RNA 5’ pyrophosphohydrolase RppH enzymes are added, while in the second pool (labeled -RppH) only the ligase is added. RppH converts the 5’PPP to 5’P that can be used by the ligase to add the adapter oligo to the RNA sequences (for detailed protocol see ([Prados et al., 2016](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3211-3))). The “-RppH” pool serves as control to measure the level of background ligation and avoid false positive TSS detection. Thus, a given genome position will be predicted as a TSS if the number of reads from the “+RppH” pool mapping at that position is significantly higher than the number of reads from the “-RppH” pool by applying a beta-binomial test as described in (Prados et al., 2016) ## Download datasets Data from this article are available on Gene Expression Omnibus (GEO) database: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85110 For this notebook, only 4 samples will be used that correspond to 5'-end RNA-Seq data from *Staphylococcus aureus* (MW2 strain) grown on RPMI liquid at 37°C (GSM2257851, GSM2257852, GSM2257859, GSM2257860). We download data directly in a the *data* directory of the GitLab project in the sub-directory called *TSS-EMOTE_Prados* where the reference genome is already provided when the project is cloned. ``` r options(timeout = 3600) # for download time, set more if timeout occurs list_SRR <- c("SRR3994381", "SRR3994382", "SRR3994390", "SRR3994389") for (i in list_SRR){ raw_fastq <- paste0("../extdata/TSS-EMOTE_Prados/", i, ".fastq.gz") if (!file.exists(raw_fastq)) download.file(paste0("https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=",i), destfile = raw_fastq) } ``` ## Pre-processing of 5'-end RNA-Seq reads ### Parse reads (+ read features) Reads are not ready-to-map yet since they still have an adapter (the RP6 oligonucleotide) located at the 5'-end of each read. We use `parse_reads`, in order to check that each read contains the expected elements in the adapter, the remove this oligonucleotide from the reads (see RP6 oligo TSS-EMOTE). `parse_reads` requires a `read_features` table which defines the expected elements on the read. Following the Materials & Methods section of the article, the structure of the reads can be schematized as follows (X is the mappable part): ``` 1 2 5 6 20 21 27 28 30 31 50 N - (EMOTE barcode) - CGGCACCAACCGAGG - VVVVVVV - CGC - XXXXXXXXXXXXXXXXXXXXX 1 random nt - EMOTE barcode - recognition.seq - UMI - control.seq - RNA 5'-end ``` We use `init_read_features` and `add_read_feature` functions to create the `read_features` table following the above structure. The initialization of the `read_features` table is made by `init_read_features` since the only required feature is the *readseq* feature (the mappable part to keep in the FASTQ file). According to our scheme, we know that the *readseq* part begins at location 31 and are 20 nucleotides long (illumina 50bp). ``` r rf <- init_read_features(start = 31, width = 20) ``` We add other features such as the *barcode* which begins at position 2 and are 4 bases long with the following specifications: - *pattern* : enables to specify which barcode sequence(s) are allowed, here we allowed 4 different sequences - *pattern_type*: is used to know which type of pattern it is, "constant" means that we check the exact *pattern* sequences allowing *N* number of mismatches (specified by the *max_mismatch* parameter) - *readid_prepend*: is used to store the retrieved feature in the read identifier. It will be useful to keep a track of features since they are removed from the reads and therefore absent from the FASTQ output file. ``` r rf <- add_read_feature(rf, name = "barcode", start = 2, width = 4, pattern = c("TCGG","CAAG","CGTC","AAGT"), pattern_type = "constant", max_mismatch = 0, readid_prepend = FALSE) ``` The *recognition sequence* feature begins at position 6 and is 15 bases long, and we expect a single unique sequence with maximum 1 mismatch: ``` r rf <- add_read_feature(rf, name = "RS", start = 6, width = 15, pattern = "CGGCACCAACCGAGG", pattern_type = "constant", max_mismatch = 1, readid_prepend = FALSE) ``` The *UMI* feature begins at position 21 and is 7 bases long. Its pattern type is different from the previous ones because: as UMIs are random sequences, we do not expect a particular sequence (nor list of sequences) but a sequence of random characters (A, C or G) hence the *patern_type = "nucleotides"*. We set *readid_prepend = TRUE* because we will need the UMIs to remove PCR duplicates introduced by the library preparation protocol. **Note:** the *readseq* feature has the same *pattern_type* than UMI and is set by default to A,C,T or G. ``` r rf <- add_read_feature(rf, name = "UMI", start = 21, width = 7, pattern = "ACG", pattern_type = "nucleotides", max_mismatch = 0, readid_prepend = TRUE) ``` The *control sequence* feature begins at position 28 and is 3 bases long, and we expect a specific and unique sequence: ``` r rf <- add_read_feature(rf, name = "CS", start = 28, width = 3, pattern = "CGC", pattern_type = "constant", max_mismatch = 0, readid_prepend = FALSE) ``` The resulting *read_features* table is: ``` r rf %>% knitr::kable() ``` |name | start| width|pattern_type |pattern | max_mismatch|readid_prepend | |:-------|-----:|-----:|:------------|:------------|------------:|:--------------| |readseq | 31| 20|nucleotides |ACTG | 0|FALSE | |barcode | 2| 4|constant |TCGG, CA.... | 0|FALSE | |RS | 6| 15|constant |CGGCACCA.... | 1|FALSE | |UMI | 21| 7|nucleotides |ACG | 0|TRUE | |CS | 28| 3|constant |CGC | 0|FALSE | Now that we have built the *read_features* table, we can parse the 4 FASTQ files with the `parse_reads` function. **Note:** `mclapply` function from the `parallel` package is used to perform all the 4 processing in parallel but `lapply` can be used instead. ``` r parsing_report <- lapply( list.files("../extdata/TSS-EMOTE_Prados/", pattern = ".fastq.gz$", full.names = TRUE), FUN = parse_reads, features = rf, # The read_features table keep_invalid_reads = FALSE, # do not store invalid reads out_dir = "../extdata/TSS-EMOTE_Prados/parse_results", # Path of the out directory (optional) force = FALSE ) #> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994381_parse_report.csv #> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994382_parse_report.csv #> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994389_parse_report.csv #> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994390_parse_report.csv ``` ``` r parsing_report %>% bind_rows %>% knitr::kable() ``` |is_valid_readseq |is_valid_barcode |is_valid_RS |is_valid_UMI |is_valid_CS | count| |:----------------|:----------------|:-----------|:------------|:-----------|-------:| |FALSE |FALSE |TRUE |TRUE |FALSE | 1| |FALSE |TRUE |TRUE |TRUE |TRUE | 1444| |TRUE |FALSE |FALSE |TRUE |FALSE | 272| |TRUE |FALSE |TRUE |TRUE |FALSE | 2668| |TRUE |TRUE |TRUE |TRUE |TRUE | 5189855| |FALSE |FALSE |FALSE |TRUE |FALSE | 3| |FALSE |FALSE |TRUE |TRUE |FALSE | 3| |FALSE |TRUE |TRUE |TRUE |TRUE | 2565| |TRUE |FALSE |FALSE |TRUE |FALSE | 3166| |TRUE |FALSE |TRUE |TRUE |FALSE | 21274| |TRUE |TRUE |TRUE |TRUE |FALSE | 1320| |TRUE |TRUE |TRUE |TRUE |TRUE | 8561052| |FALSE |FALSE |FALSE |TRUE |FALSE | 4| |FALSE |TRUE |TRUE |TRUE |TRUE | 2378| |TRUE |FALSE |FALSE |TRUE |FALSE | 2789| |TRUE |FALSE |TRUE |TRUE |FALSE | 1308| |TRUE |TRUE |TRUE |TRUE |TRUE | 3177298| |FALSE |FALSE |FALSE |TRUE |FALSE | 1| |FALSE |FALSE |TRUE |TRUE |FALSE | 1| |FALSE |TRUE |TRUE |TRUE |TRUE | 2876| |TRUE |FALSE |FALSE |TRUE |FALSE | 2924| |TRUE |FALSE |TRUE |TRUE |FALSE | 1237| |TRUE |TRUE |TRUE |TRUE |TRUE | 3774778| The parsing_report provides information on the validity of reads in each FASTQ file. We can aggregate and visualize the global statititics for all reports with an upset plot: ``` r parsing_report %>% bind_rows %>% group_by(across(starts_with("is_valid_"))) %>% summarise(count = sum(count), .groups = "drop") %>% reads_parse_report_ggupset(angle = 45, margin=0.3) ``` ![](../../vignettes/TSS_identification_files/figure-html/unnamed-chunk-12-1.png) ## Quantification ### Mapping Reads are now ready-to-map on the reference genome using the `map_fastq_files` function. **Note:** here, we use the function that allow the mapping of multiple FASTQ files in parallel which is a wrapper that calls the `map_fastq` function for each of the FASTQ files provided. ``` r bf <- map_fastq_files( list_fastq_files = list.files("../extdata/TSS-EMOTE_Prados/parse_results", pattern = "_valid.fastq.gz$", full.names = TRUE), index_name = "../extdata/TSS-EMOTE_Prados/MW2.sr", # Path of the genome index (created if not exists) fasta_name = "../extdata/TSS-EMOTE_Prados/MW2.fasta", # Path of the reference genome file bam_dir = "../extdata/TSS-EMOTE_Prados/mapping_results_subread", # Output directory max_map_mismatch = 1, aligner = "subread", # either bowtie or subread map_unique = FALSE, force = FALSE, threads = 3) #> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam' already exists, set force=TRUE to recompute bf #> [[1]] #> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam" #> #> [[2]] #> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam" #> #> [[3]] #> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam" #> #> [[4]] #> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam" ``` #### Mapping statistics Use the `map_stats_bam_files` function to get some basic statistics on the mapping we've just performed: ``` r map_stats_bam_files(bam_files = bf) %>% knitr::kable() #> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute ``` |bam_file | mapped| unmapped| pc_mapped| |:-----------------------------|-------:|--------:|---------:| |SRR3994381_valid.fastq.gz.bam | 2001012| 3188843| 38.56| |SRR3994382_valid.fastq.gz.bam | 3436440| 5124612| 40.14| |SRR3994389_valid.fastq.gz.bam | 781940| 2395358| 24.61| |SRR3994390_valid.fastq.gz.bam | 1433740| 2341038| 37.98| ### Quantification (counts computation) In order to obtain counts per position, we use the `quantify_from_bam_files` function to parse previously obtained `bam_files`. This function allows the quantification of either the 5'-ends (`quantify_5prime`) or the 3'-ends (`quantify_3prime`), in both cases, it is an exact position quantification. For UMIs to be taken into account for the quantification, we provide the `read_features` table with the *"features"* parameter. The *keep_UMI* parameter is used to record the number of copy of each UMI by exact position. This level of information is not needed for this analysis and we use it as default (FALSE). ``` r count_table <- quantify_from_bam_files(bam_files = bf, sample_ids = c("rep1-", "rep1+", "rep2-", "rep2+"), fun = quantify_5prime, # Either quantify_5prime or quantify_3prime features = rf, keep_UMI = FALSE) #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.csv count_table %>% head(15) %>% knitr::kable() ``` |sample_id |rname | position|strand | reads| count| |:---------|:-----------|--------:|:------|-----:|-----:| |rep1- |NC_003923.1 | 313|+ | 2| 2| |rep1- |NC_003923.1 | 425|+ | 34| 9| |rep1- |NC_003923.1 | 535|+ | 11| 1| |rep1- |NC_003923.1 | 637|+ | 1| 1| |rep1- |NC_003923.1 | 689|+ | 1| 1| |rep1- |NC_003923.1 | 791|+ | 2| 1| |rep1- |NC_003923.1 | 868|+ | 2| 2| |rep1- |NC_003923.1 | 911|+ | 2| 1| |rep1- |NC_003923.1 | 918|+ | 6| 1| |rep1- |NC_003923.1 | 1009|+ | 3| 1| |rep1- |NC_003923.1 | 1166|+ | 2| 1| |rep1- |NC_003923.1 | 1309|+ | 10| 1| |rep1- |NC_003923.1 | 1310|+ | 3| 1| |rep1- |NC_003923.1 | 1357|- | 1| 1| |rep1- |NC_003923.1 | 1890|+ | 1| 1| The above *count table* will allow us to perform some downstream analysis, *i.e.* TSS identification. **Note:** here, the *count* column corresponds to the number of different UMIs mapped at this exact position. This is why it is lower or equal to the *reads* values (due to PCR duplicates). ## Downstream analysis: TSS identification ### Design table The identification of TSS requires the comparison of 5'P and 5'P+5'PPP samples with multiple replicates. We thus build a `design table` to identify which samples is what. The content of the design table is free as it includes the colnames: - *sample_id* (same as in the count table) - *pool* (library preparation for sequencing either "*mono*" for 5'P or "*total*" for 5'P+5'PPP) - *replicate* (p-values are combined by replicate)
``` r design_tb <- tibble( sample_id = c("rep1-", "rep1+", "rep2-", "rep2+"), count_of = c("fraction", "total", "fraction", "total" ), replicate = c(1, 1, 2, 2) ) design_tb %>% knitr::kable() ``` |sample_id |count_of | replicate| |:---------|:--------|---------:| |rep1- |fraction | 1| |rep1+ |total | 1| |rep2- |fraction | 2| |rep2+ |total | 2| ## Identification of TSS We finally use the `betabinom_TSS` function to identify TSSs with the *count table* and the *design table* freshly built as inputs. **Note:** the filtering of data is made by the function (and not before) because it needs the total number of counts before filtering. ``` r bbin <- TSS(count_table = count_table, design_table = design_tb, threshold = 5, # Number minimum reads allowed in the total sample combine_method = "fisher", # Method for the p-values combination fdr_method = "fdr" # Method for adjusting p-value (see p.adjust) ) bbin %>% head(15) %>% knitr::kable() ``` |rname | position|strand | pvalue| pvalue_adj| |:-----------|--------:|:------|---------:|----------:| |NC_003923.1 | 312|+ | 0.0017109| 0.0071046| |NC_003923.1 | 313|+ | 0.0000000| 0.0000000| |NC_003923.1 | 314|+ | 0.0000001| 0.0000008| |NC_003923.1 | 1642|+ | 0.0725677| 0.1682600| |NC_003923.1 | 2131|+ | 0.0001241| 0.0006330| |NC_003923.1 | 3649|+ | 0.0000009| 0.0000058| |NC_003923.1 | 3650|+ | 0.0000000| 0.0000000| |NC_003923.1 | 3652|+ | 0.0000001| 0.0000006| |NC_003923.1 | 4128|+ | 0.0025859| 0.0103542| |NC_003923.1 | 4159|- | 0.0000194| 0.0001102| |NC_003923.1 | 4301|+ | 0.1746218| 0.3408392| |NC_003923.1 | 6734|+ | 0.0182064| 0.0576161| |NC_003923.1 | 10687|- | 0.0000000| 0.0000000| |NC_003923.1 | 10855|+ | 0.0000004| 0.0000030| |NC_003923.1 | 10856|+ | 0.1536700| 0.3042540| Number of TSSs identified with an adjusted p-value (FDR) < 0.01: 2677. ### Clusters of TSS As in the *Prados et al.* paper we can clusterise TSSs by regrouping those less than 5 nucleotides apart. ``` r bbin <- bbin %>% filter(pvalue_adj < 0.01) %>% arrange(position) %>% mutate(cluster_id = cumsum(c(1, diff(position) >= 5))) bbin %>% head(15) %>% knitr::kable() ``` |rname | position|strand | pvalue| pvalue_adj| cluster_id| |:-----------|--------:|:------|---------:|----------:|----------:| |NC_003923.1 | 312|+ | 0.0017109| 0.0071046| 1| |NC_003923.1 | 313|+ | 0.0000000| 0.0000000| 1| |NC_003923.1 | 314|+ | 0.0000001| 0.0000008| 1| |NC_003923.1 | 2131|+ | 0.0001241| 0.0006330| 2| |NC_003923.1 | 3649|+ | 0.0000009| 0.0000058| 3| |NC_003923.1 | 3650|+ | 0.0000000| 0.0000000| 3| |NC_003923.1 | 3652|+ | 0.0000001| 0.0000006| 3| |NC_003923.1 | 4159|- | 0.0000194| 0.0001102| 4| |NC_003923.1 | 10687|- | 0.0000000| 0.0000000| 5| |NC_003923.1 | 10855|+ | 0.0000004| 0.0000030| 6| |NC_003923.1 | 12473|+ | 0.0000000| 0.0000000| 7| |NC_003923.1 | 14672|+ | 0.0000634| 0.0003335| 8| |NC_003923.1 | 15797|+ | 0.0000000| 0.0000000| 9| |NC_003923.1 | 17356|+ | 0.0000000| 0.0000000| 10| |NC_003923.1 | 17357|+ | 0.0000000| 0.0000000| 10| Number of cluster of TSSs identified with an adjusted p-value (fdr) < 0.01: 1923. ``` r knitr::knit_exit() ``` ### Part of paper moved here Transcription Start Sites identification RNA-end sequencing was initially motivated to address the inability of RNA sequencing protocols to determine precisely the exact 5' end of RNA molecules in vivo and their poor performance in identifying and quantifying the preferred transcription start site (TSS) at a genomic location. The strategy in prokaryotes is based on the phosphorylation state of the RNA 5’ end. Indeed, only RNAs which 5’ ends are tri-phosphorylated correspond to native 5’ extremities and therefore to TSSs. To illustrate a first type of statistical analysis that can be performed on the basis of the results provided by the application of the rnaends package, we used the results obtained in (Prados et al., 2016) for the detection of TSS in Staphylococcus aureus, based on the TSS-EMOTE protocol Prados2016 (Redder, 2015). We treated the raw data from (Prados et al., 2016) with ranends from the pre-processing step to the count table in which mapped valid reads are counted by reference name, position and strand. This table is paired with a design matrix that associates each barcode to an experimental condition (strain, replicate, assay). We have implemented a function dedicated to statistical processing using beta-binomial testing as described in Prados2016, betabinom_TSS, which integrates UMIs into the quantification step, in order to eliminate duplicates from the PCR amplification step and consolidates statistical results. Compared with Prados’s results, we identify 126 additional TSS clusters while 16 previously identified TSS clusters are no longer detected. After investigation, it appears that the population size used in the binomial tests in their analyses was incorrectly implemented, i.e. they use the library size before PCR duplicates removal which leads to erroneous results. The detailed analysis and code is available in a vignette provided as supplementary material at https://rnaends.gitlab.io/rnaends/vignettes/TSS_identification.html. Given that this approach is simply based on the application of a beta binomial model between two samples, it can easily be transposed to analyze the results of other experimental design.