--- title: "Endonucleolytic cleavage sites prediction using 5'-end RNA-Seq data" output: html_document: keep_md: true vignette: > %\VignetteIndexEntry{Endoribonucleolytic_site_prediction_using_differential_expression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Endonucleolytic cleavage site prediction using 5'-end RNA-Seq data Complete source of this page is available in the notebook [Endoribonucleolytic_sites_prediction.Rmd](https://gitlab.com/rnaends/rnaends/-/blob/main/vignettes/Endoribonucleolytic_sites_prediction.Rmd). Based on the publication from Broglia L, Lécrivain A-L, Renault TT, Hahnke K, Ahmed-Begrich R, Le Rhun A, Charpentier E (2020) An RNA-seq based comparative approach reveals the transcriptome-wide interplay between 3′-to-5′ exoRNases and RNase Y. Nature Communications, 11, 1587. [https://doi.org/10.1038/s41467-020-15387-6](https://doi.org/10.1038/s41467-020-15387-6) ## Download datasets Data from this article are publicly available on the SRA NCBI repository: https://www.ncbi.nlm.nih.gov/sra?term=SRP149896 For this notebook, we will use only 9 RNA-Seq samples corresponding to 9 datasets of *Streptococcus pyogenes* 5'-end RNA-Seq (SRX4173017, SRX4173016, SRX4173015, SRX4173014, SRX4173013, SRX4173012, SRX4173011, SRX4173010, SRX4173009). As the datasets are too bug for direct download, `fasterq-dump` (instead of basic "curl/wget" command) from [ncbi/sra-tools](https://github.com/ncbi/sra-tools) must be used to retrieve FASTQ files directly in the `extdata` directory of the GitLab project in the sub-directory called `Spyo_Broglia` where the reference genome is already provided when the project is clone: ``` for i in SRR7269351 SRR7269352 SRR7269353 SRR7269354 SRR7269355 SRR7269356 SRR7269357 SRR7269358 SRR7269359 ; do fasterq-dump $i --outdir ../extdata/Spyo_Broglia ; done ``` We remove all reverse files (rm SRR7236935._2.fastq.gz) and we gzip all forward files. ``` r library(tidyverse) library(rnaends) ``` ## Pre-processing Thse datasets were already pre-processed when they have been submitted, so we can directly map the FASTQ files on the reference genome. ## Quantification We use the **map_fastq_files** to map the 9 FASTQ files in a single command. ``` r bf <- map_fastq_files(list_fastq_files = list.files("../extdata/Spyo_Broglia/", pattern = ".fastq.gz", full.names = TRUE), index_name = "../extdata/Spyo_Broglia/data/spyo_SF370.bt", # Path of the genome index (created if not exists) fasta_name = "../extdata/Spyo_Broglia/spyo_SF370.fasta", # Path of the reference genome file bam_dir = "../extdata/Spyo_Broglia/mapping_results", # Output directory max_map_mismatch = 1, aligner = "bowtie", # Either bowtie or subread map_unique = FALSE, force = F, threads = 3) #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam' already exists, set force=TRUE to recompute #> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam' already exists, set force=TRUE to recompute bf #> [[1]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam" #> #> [[2]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam" #> #> [[3]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam" #> #> [[4]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam" #> #> [[5]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam" #> #> [[6]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam" #> #> [[7]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam" #> #> [[8]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam" #> #> [[9]] #> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam" ``` ### Mapping statistics The **map_stats_bam_files** function provides some basic statistics on the mapping: ``` r map_stats_bam_files(bam_files = bf) %>% knitr::kable() #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute #> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute ``` |bam_file | mapped| unmapped| pc_mapped| |:-------------------------|--------:|--------:|---------:| |SRR7269351_1.fastq.gz.bam | 41814799| 988534| 97.69| |SRR7269352_1.fastq.gz.bam | 41739505| 938889| 97.80| |SRR7269353_1.fastq.gz.bam | 43357081| 1019019| 97.70| |SRR7269354_1.fastq.gz.bam | 43026077| 1016127| 97.69| |SRR7269355_1.fastq.gz.bam | 34567930| 650684| 98.15| |SRR7269356_1.fastq.gz.bam | 37802063| 705215| 98.17| |SRR7269357_1.fastq.gz.bam | 26960617| 641089| 97.68| |SRR7269358_1.fastq.gz.bam | 34908063| 977130| 97.28| |SRR7269359_1.fastq.gz.bam | 40858741| 759768| 98.17| ### Quantification (counts computation) In order to obtain counts per position, we use the **quantify_from_bam_files** function to parse previoulsy obtained bam_files. This function allows the quantification of either the 5'-ends (**quantify_5prime**) or the 3'-ends (**quantify_3prime**), in all cases it's an exact position quantification. Run with minimal parameters: ``` r count_table <- quantify_from_bam_files(bam_files = bf, sample_ids = c("WT1", "WT2", "WT3", "Y1", "Y2", "Y3", "suppl1", "suppl2", "suppl3"), # Set an identifier to know the source of each values fun = quantify_5prime) #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam.csv #> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam.csv count_table %>% head(15) %>% knitr::kable() ``` |sample_id |rname | position|strand | count| |:---------|:----------|--------:|:------|-----:| |WT1 |AE004092.2 | 38|+ | 2| |WT1 |AE004092.2 | 205|+ | 819| |WT1 |AE004092.2 | 206|+ | 13| |WT1 |AE004092.2 | 211|+ | 7| |WT1 |AE004092.2 | 215|+ | 8| |WT1 |AE004092.2 | 217|+ | 34| |WT1 |AE004092.2 | 218|+ | 2| |WT1 |AE004092.2 | 219|+ | 2| |WT1 |AE004092.2 | 219|- | 5| |WT1 |AE004092.2 | 236|+ | 15| |WT1 |AE004092.2 | 248|+ | 31| |WT1 |AE004092.2 | 253|+ | 13| |WT1 |AE004092.2 | 254|+ | 1| |WT1 |AE004092.2 | 262|+ | 14| |WT1 |AE004092.2 | 265|+ | 117| ## Downstream analysis ### Differential abundances for prediction of cleavage sites As in the article, we're going to carry out a differential gene expression analysis (DGE), but at the resolution of an exact position to detect RNase Y endonucleolytic cleavage sites by comparing 3 *S. pyogenes* strains: * Wild-type * RNase Y deletion mutant * RNAse Y deletion mutant supplemented To perform a DGE analysis the *"edgeR"* package is be used (which is not part of the requirements of the rnaends package). We use as input data the **count table** we've just produced. ``` r library(edgeR) ``` #### Pivot the count table to create a DGEList object ``` r group1 <- c("WT", "WT", "WT") group2 <- c("Y", "Y", "Y") group3 <- c("suppl", "suppl", "suppl") # Pivot count table data <- count_table %>% pivot_wider(id_cols = c(position, strand), names_from = sample_id, values_from = count, values_fill = 0) # Transform into matrix for DGElist mat <- data[,3:ncol(data)] %>% as.matrix() rownames(mat) <- paste0(data$position, data$strand) mat %>% head(15) %>% knitr::kable() ``` | | WT1| WT2| WT3| Y1| Y2| Y3| suppl1| suppl2| suppl3| |:----|---:|---:|---:|----:|---:|---:|------:|------:|------:| |38+ | 2| 0| 0| 0| 0| 0| 0| 0| 0| |205+ | 819| 491| 358| 1001| 345| 568| 212| 304| 976| |206+ | 13| 22| 5| 44| 17| 117| 3| 4| 21| |211+ | 7| 11| 5| 35| 0| 0| 1| 1| 0| |215+ | 8| 1| 10| 0| 4| 4| 1| 0| 0| |217+ | 34| 13| 7| 16| 2| 8| 15| 1| 10| |218+ | 2| 0| 0| 0| 0| 0| 0| 0| 0| |219+ | 2| 0| 1| 0| 0| 0| 0| 0| 0| |219- | 5| 0| 0| 0| 0| 0| 0| 0| 0| |236+ | 15| 0| 0| 1| 0| 1| 0| 0| 31| |248+ | 31| 15| 2| 6| 0| 9| 45| 0| 1| |253+ | 13| 0| 0| 0| 0| 1| 0| 0| 0| |254+ | 1| 0| 0| 3| 0| 0| 0| 0| 0| |262+ | 14| 0| 6| 1| 3| 14| 43| 0| 0| |265+ | 117| 33| 1| 0| 15| 11| 0| 0| 0| #### DGEList and filter count according cpm ``` r dge <- DGEList(counts = mat[,1:9], group = factor(c(group1, group2, group3)) ) # Keep position where there is at least 6 conditions with cpm>0.05 keep <- rowSums(cpm(dge) >= 0.05) >= 6 dge.filter <- dge[keep, ] # Update dge (for racalcul of cpm) dge2.filter <- DGEList(dge.filter$counts, group = factor(c(group1, group2, group3)) ) # Keep position where there is at least 3 conditions with cpm>5 keep <- rowSums(cpm(dge2.filter) >= 5) >= 3 dge3.filter <- dge2.filter[keep, ] cpm(dge3.filter) %>% head(15) %>% knitr::kable() ``` | | WT1| WT2| WT3| Y1| Y2| Y3| suppl1| suppl2| suppl3| |:------|----------:|-----------:|----------:|----------:|---------:|----------:|----------:|----------:|-----------:| |205+ | 22.170898| 13.4253747| 9.1030977| 27.186211| 10.908131| 16.3304562| 8.5810845| 9.3700179| 26.9990941| |2540+ | 5.820199| 2.1054050| 7.6028665| 1.249316| 1.359564| 2.5588215| 9.9572962| 2.3425045| 0.2489671| |3452+ | 132.483974| 189.8965934| 68.4003711| 212.057875| 77.115743| 78.1446831| 47.9650241| 67.0079572| 177.7071523| |4499+ | 6.253330| 8.1481908| 2.5173371| 8.609419| 1.960302| 1.3225369| 3.0357610| 1.1712522| 0.0276630| |4512+ | 20.790293| 7.3825889| 7.2977348| 14.828842| 13.121375| 4.6576301| 11.6573223| 7.3665602| 0.1659780| |11902+ | 12.777367| 6.4255867| 12.4341195| 28.082459| 14.259615| 15.2954273| 2.0238407| 6.8117565| 26.1138779| |12203+ | 10.124439| 8.6403634| 4.2718447| 4.725675| 6.797821| 6.6414355| 1.4571653| 7.6131396| 0.8852162| |13365+ | 8.418986| 3.1991219| 4.3481277| 5.377492| 7.493412| 3.7376044| 3.3190987| 0.7089158| 4.5367330| |13441+ | 5.874340| 6.9997880| 3.6107259| 5.784878| 2.719128| 3.3350932| 4.4119727| 0.2157570| 0.0000000| |13446+ | 12.750297| 5.7693565| 7.7554324| 9.695781| 4.584577| 8.2514805| 24.4075186| 1.5719438| 0.0276630| |13451+ | 8.121208| 0.1093717| 9.4590848| 7.197149| 1.201475| 0.5750161| 3.4405292| 0.3698691| 0.2766301| |13475+ | 12.235954| 24.9094020| 3.1530283| 8.745215| 4.963990| 5.3764002| 4.4119727| 3.6062240| 23.5412183| |13527+ | 1.840807| 0.2460863| 0.9916782| 14.204184| 5.469874| 6.6126847| 0.0404768| 0.3082243| 21.0792108| |13552+ | 17.541809| 6.7263588| 0.9662506| 8.364988| 7.019145| 5.4626526| 8.0548859| 0.4931588| 0.0276630| |13559+ | 6.172118| 3.8553520| 0.3559871| 16.376908| 1.517653| 4.1113649| 4.2905422| 0.0616449| 9.3224331| #### Normalize and estimate dispersion Normalization ``` r dge3.filter <- calcNormFactors(dge3.filter, method = "TMM") ``` Dispersion ``` r disp <- estimateDisp(dge3.filter) #> Using classic mode. ``` #### Exact tests Comparison WT vs RNase Y deletion mutant ``` r et <- exactTest(disp, pair = c("WT", "Y")) res_WT <- et$table res_WT$fdr <- p.adjust(res_WT$PValue, method = "fdr") # Adjusting p.values res_WT$Position <- as.integer(str_sub(rownames(res_WT), 1, -2)) # Retrieving positions using stocked into the rownames res_WT$strand <- str_sub(rownames(res_WT), -1) # Retrieving strand res_WT %>% head(15) %>% knitr::kable() ``` | | logFC| logCPM| PValue| fdr| Position|strand | |:------|----------:|--------:|---------:|---------:|--------:|:------| |205+ | 0.4757538| 4.097125| 0.6161737| 1.0000000| 205|+ | |2540+ | -1.2969346| 1.768362| 0.3370351| 1.0000000| 2540|+ | |3452+ | 0.0168134| 6.936504| 0.9846036| 1.0000000| 3452|+ | |4499+ | -0.4542850| 1.736353| 0.7499819| 1.0000000| 4499|+ | |4512+ | 0.0787080| 3.143991| 0.9482911| 1.0000000| 4512|+ | |11902+ | 1.0497216| 3.917431| 0.3388455| 1.0000000| 11902|+ | |12203+ | -0.0990961| 2.468814| 0.9318575| 1.0000000| 12203|+ | |13365+ | 0.2956658| 2.204602| 0.8001578| 1.0000000| 13365|+ | |13441+ | -0.3248103| 1.752619| 0.8400716| 1.0000000| 13441|+ | |13446+ | 0.0078938| 2.913511| 0.9962916| 1.0000000| 13446|+ | |13451+ | -0.8812748| 1.583629| 0.5850758| 1.0000000| 13451|+ | |13475+ | -0.9495031| 3.505009| 0.4234994| 1.0000000| 13475|+ | |13527+ | 3.3057756| 2.826576| 0.0722374| 0.5765283| 13527|+ | |13552+ | -0.0277194| 2.451620| 0.9866344| 1.0000000| 13552|+ | |13559+ | 1.1973548| 2.436253| 0.4545305| 1.0000000| 13559|+ | Comparison supplemented RNase Y deletion mutant *vs.* RNase Y deletion mutant ``` r et <- exactTest(disp, pair = c("suppl", "Y")) res_suppl <- et$table res_suppl$fdr <- p.adjust(res_suppl$PValue, method = "fdr") # Adjusting p.values res_suppl$Position <- as.integer(str_sub(rownames(res_suppl),1,-2)) # Retrieving positions using stocked into the rownames res_suppl$strand <- str_sub(rownames(res_suppl),-1) # Retrieving strand res_suppl %>% head(15) %>% knitr::kable() ``` | | logFC| logCPM| PValue| fdr| Position|strand | |:------|----------:|--------:|---------:|---------:|--------:|:------| |205+ | -0.2700981| 4.097125| 0.7756757| 0.9841472| 205|+ | |2540+ | -1.1317669| 1.768362| 0.4033664| 0.8990873| 2540|+ | |3452+ | -0.2658754| 6.936504| 0.7586034| 0.9797373| 3452|+ | |4499+ | 1.4232888| 1.736353| 0.3289756| 0.8528307| 4499|+ | |4512+ | 0.7377628| 3.143991| 0.5398056| 0.9480726| 4512|+ | |11902+ | 0.0581356| 3.917431| 0.9581940| 0.9984513| 11902|+ | |12203+ | 0.7146266| 2.468814| 0.5326933| 0.9457086| 12203|+ | |13365+ | 0.5319706| 2.204602| 0.6509544| 0.9710619| 13365|+ | |13441+ | 1.4779476| 1.752619| 0.3635666| 0.8768117| 13441|+ | |13446+ | -0.0561098| 2.913511| 0.9706936| 0.9984513| 13446|+ | |13451+ | 1.0310394| 1.583629| 0.5285491| 0.9456388| 13451|+ | |13475+ | -1.3589715| 3.505009| 0.2557154| 0.7890100| 13475|+ | |13527+ | -0.5195980| 2.826576| 0.7595278| 0.9799139| 13527|+ | |13552+ | 1.4159979| 2.451620| 0.3564504| 0.8744636| 13552|+ | |13559+ | 0.0373990| 2.436253| 0.9827840| 0.9990971| 13559|+ | #### Visualization ``` r res_WT %>% ggplot(aes(x = logFC, y = -log10(fdr), color = logCPM)) + geom_point(alpha = 0.5) + theme_bw() + geom_vline(xintercept = -1, col = "red") + geom_hline(yintercept = -log10(0.05), col = "red") ``` ![](../../vignettes/Endoribonucleolytic_sites_prediction_files/figure-html/unnamed-chunk-13-1.png) #### Merge comparisons and select significant positions To identify a position as a RNase Y endonucleolytic cleavage site, Broglia et al. keep only positions where the fdr is lower than 0.01 and logFC lower of equal to -1 in both comparison we've made. So we are going to merge our both result tables then filter following these criterias. ``` r res <- res_suppl %>% inner_join(res_WT, by = c("Position","strand"), suffix = c(".suppl", ".WT")) ``` Number of cleavage sites identified: ``` r endoY <- res %>% filter(fdr.WT<0.05 & fdr.suppl<0.05 & logFC.WT<= -1 & logFC.suppl<= -1) endoY %>% nrow #> [1] 532 ``` Number of clusters of cleavage sites identified: ``` r endoY <- endoY %>% arrange(Position) %>% mutate(ID = cumsum(c(1, diff(Position) >= 5))) %>% arrange(desc(ID)) endoY %>% group_by(ID) %>% n_groups #> [1] 381 ``` ### Comparison with Broglia results We provide a small table of RNase Y cleavage sites identifed by *Broglia et al.* using 5'-end RNA-Seq in the extdata directory our GitLab project in order to make the comparison. Number of RNase Y cleavage sites identified using 5'-end RNA-Seq: ``` r csy <- read_csv("../extdata/Spyo_Broglia/RNaseY_cleavage_site_5Prime.csv", show_col_types = F) csy %>% nrow #> [1] 190 ``` Number of RNase Y cleavage sites found in common: ``` r common <- endoY %>% inner_join(csy, by="Position") common %>% nrow #> [1] 119 ``` Number of RNase Y cleavage sites not found by us: ``` r not_found <- csy %>% anti_join(endoY, by ="Position") not_found %>% nrow #> [1] 71 ``` Number of sites found not significant: ``` r res %>% filter(Position %in% not_found$Position) %>% nrow #> [1] 32 ``` Number of sites not tested due to CPM filtering: ``` r not_found %>% anti_join(res, by = "Position") %>% nrow #> [1] 39 ```