Endonucleolytic cleavage site prediction using 5’-end RNA-Seq data

Complete source of this page is available in the notebook Endoribonucleolytic_sites_prediction.Rmd.

Based on the publication from Broglia L, Lécrivain A-L, Renault TT, Hahnke K, Ahmed-Begrich R, Le Rhun A, Charpentier E (2020) An RNA-seq based comparative approach reveals the transcriptome-wide interplay between 3′-to-5′ exoRNases and RNase Y. Nature Communications, 11, 1587. https://doi.org/10.1038/s41467-020-15387-6

Download datasets

Data from this article are publicly available on the SRA NCBI repository: https://www.ncbi.nlm.nih.gov/sra?term=SRP149896

For this notebook, we will use only 9 RNA-Seq samples corresponding to 9 datasets of Streptococcus pyogenes 5’-end RNA-Seq (SRX4173017, SRX4173016, SRX4173015, SRX4173014, SRX4173013, SRX4173012, SRX4173011, SRX4173010, SRX4173009).

As the datasets are too bug for direct download, fasterq-dump (instead of basic “curl/wget” command) from ncbi/sra-tools must be used to retrieve FASTQ files directly in the extdata directory of the GitLab project in the sub-directory called Spyo_Broglia where the reference genome is already provided when the project is clone:

for i in SRR7269351	SRR7269352 SRR7269353 SRR7269354 SRR7269355 SRR7269356 SRR7269357 SRR7269358 SRR7269359 ; do 
  fasterq-dump $i --outdir ../extdata/Spyo_Broglia ; 
done

We remove all reverse files (rm SRR7236935._2.fastq.gz) and we gzip all forward files.

library(tidyverse)
library(rnaends)

Pre-processing

Thse datasets were already pre-processed when they have been submitted, so we can directly map the FASTQ files on the reference genome.

Quantification

We use the map_fastq_files to map the 9 FASTQ files in a single command.

bf <- map_fastq_files(list_fastq_files = list.files("../extdata/Spyo_Broglia/",
                                                    pattern = ".fastq.gz",
                                                    full.names = TRUE),
                      index_name = "../extdata/Spyo_Broglia/data/spyo_SF370.bt", # Path of the genome index (created if not exists)
                      fasta_name = "../extdata/Spyo_Broglia/spyo_SF370.fasta", # Path of the reference genome file
                      bam_dir = "../extdata/Spyo_Broglia/mapping_results", # Output directory
                      max_map_mismatch = 1,
                      aligner = "bowtie", # Either bowtie or subread
                      map_unique = FALSE,
                      force = F,
                      threads = 3)
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam' already exists, set force=TRUE to recompute
bf
#> [[1]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam"
#> 
#> [[2]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam"
#> 
#> [[3]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam"
#> 
#> [[4]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam"
#> 
#> [[5]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam"
#> 
#> [[6]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam"
#> 
#> [[7]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam"
#> 
#> [[8]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam"
#> 
#> [[9]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam"

Mapping statistics

The map_stats_bam_files function provides some basic statistics on the mapping:

map_stats_bam_files(bam_files = bf) %>%
  knitr::kable()
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute

bam_file	mapped	unmapped	pc_mapped
SRR7269351_1.fastq.gz.bam	41814799	988534	97.69
SRR7269352_1.fastq.gz.bam	41739505	938889	97.80
SRR7269353_1.fastq.gz.bam	43357081	1019019	97.70
SRR7269354_1.fastq.gz.bam	43026077	1016127	97.69
SRR7269355_1.fastq.gz.bam	34567930	650684	98.15
SRR7269356_1.fastq.gz.bam	37802063	705215	98.17
SRR7269357_1.fastq.gz.bam	26960617	641089	97.68
SRR7269358_1.fastq.gz.bam	34908063	977130	97.28
SRR7269359_1.fastq.gz.bam	40858741	759768	98.17

Quantification (counts computation)

In order to obtain counts per position, we use the quantify_from_bam_files function to parse previoulsy obtained bam_files.

This function allows the quantification of either the 5’-ends (quantify_5prime) or the 3’-ends (quantify_3prime), in all cases it’s an exact position quantification.

Run with minimal parameters:

count_table <- quantify_from_bam_files(bam_files = bf,
                                       sample_ids = c("WT1", "WT2", "WT3", 
                                                    "Y1", "Y2", "Y3", 
                                                    "suppl1", "suppl2", "suppl3"), # Set an identifier to know the source of each values
                                       fun = quantify_5prime)
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam.csv

count_table %>% 
  head(15) %>%
  knitr::kable()

sample_id	rname	position	strand	count
WT1	AE004092.2	38	+	2
WT1	AE004092.2	205	+	819
WT1	AE004092.2	206	+	13
WT1	AE004092.2	211	+	7
WT1	AE004092.2	215	+	8
WT1	AE004092.2	217	+	34
WT1	AE004092.2	218	+	2
WT1	AE004092.2	219	+	2
WT1	AE004092.2	219	-	5
WT1	AE004092.2	236	+	15
WT1	AE004092.2	248	+	31
WT1	AE004092.2	253	+	13
WT1	AE004092.2	254	+	1
WT1	AE004092.2	262	+	14
WT1	AE004092.2	265	+	117

Downstream analysis

Differential abundances for prediction of cleavage sites

As in the article, we’re going to carry out a differential gene expression analysis (DGE), but at the resolution of an exact position to detect RNase Y endonucleolytic cleavage sites by comparing 3 S. pyogenes strains:

Wild-type
RNase Y deletion mutant
RNAse Y deletion mutant supplemented

To perform a DGE analysis the “edgeR” package is be used (which is not part of the requirements of the rnaends package).

We use as input data the count table we’ve just produced.

library(edgeR)

Pivot the count table to create a DGEList object

group1 <- c("WT", "WT", "WT")
group2 <- c("Y", "Y", "Y")
group3 <- c("suppl", "suppl", "suppl")

# Pivot count table
data <- count_table %>% 
  pivot_wider(id_cols = c(position, strand),
              names_from = sample_id,
              values_from = count, values_fill = 0)

# Transform into matrix for DGElist
mat <- data[,3:ncol(data)] %>% 
  as.matrix()
rownames(mat) <- paste0(data$position, data$strand)
mat %>% 
  head(15) %>%
  knitr::kable()

	WT1	WT2	WT3	Y1	Y2	Y3	suppl1	suppl2	suppl3
38+	2	0	0	0	0	0	0	0	0
205+	819	491	358	1001	345	568	212	304	976
206+	13	22	5	44	17	117	3	4	21
211+	7	11	5	35	0	0	1	1	0
215+	8	1	10	0	4	4	1	0	0
217+	34	13	7	16	2	8	15	1	10
218+	2	0	0	0	0	0	0	0	0
219+	2	0	1	0	0	0	0	0	0
219-	5	0	0	0	0	0	0	0	0
236+	15	0	0	1	0	1	0	0	31
248+	31	15	2	6	0	9	45	0	1
253+	13	0	0	0	0	1	0	0	0
254+	1	0	0	3	0	0	0	0	0
262+	14	0	6	1	3	14	43	0	0
265+	117	33	1	0	15	11	0	0	0

DGEList and filter count according cpm

dge <- DGEList(counts = mat[,1:9], group = factor(c(group1, group2, group3)) )

# Keep position where there is at least 6 conditions with cpm>0.05
keep <- rowSums(cpm(dge) >= 0.05) >= 6
dge.filter <- dge[keep, ]

# Update dge (for racalcul of cpm)
dge2.filter <- DGEList(dge.filter$counts,
                       group = factor(c(group1, group2, group3)) )

# Keep position where there is at least 3 conditions with cpm>5
keep <- rowSums(cpm(dge2.filter) >= 5) >= 3
dge3.filter <- dge2.filter[keep, ] 

cpm(dge3.filter) %>% 
  head(15) %>% knitr::kable()

	WT1	WT2	WT3	Y1	Y2	Y3	suppl1	suppl2	suppl3
205+	22.170898	13.4253747	9.1030977	27.186211	10.908131	16.3304562	8.5810845	9.3700179	26.9990941
2540+	5.820199	2.1054050	7.6028665	1.249316	1.359564	2.5588215	9.9572962	2.3425045	0.2489671
3452+	132.483974	189.8965934	68.4003711	212.057875	77.115743	78.1446831	47.9650241	67.0079572	177.7071523
4499+	6.253330	8.1481908	2.5173371	8.609419	1.960302	1.3225369	3.0357610	1.1712522	0.0276630
4512+	20.790293	7.3825889	7.2977348	14.828842	13.121375	4.6576301	11.6573223	7.3665602	0.1659780
11902+	12.777367	6.4255867	12.4341195	28.082459	14.259615	15.2954273	2.0238407	6.8117565	26.1138779
12203+	10.124439	8.6403634	4.2718447	4.725675	6.797821	6.6414355	1.4571653	7.6131396	0.8852162
13365+	8.418986	3.1991219	4.3481277	5.377492	7.493412	3.7376044	3.3190987	0.7089158	4.5367330
13441+	5.874340	6.9997880	3.6107259	5.784878	2.719128	3.3350932	4.4119727	0.2157570	0.0000000
13446+	12.750297	5.7693565	7.7554324	9.695781	4.584577	8.2514805	24.4075186	1.5719438	0.0276630
13451+	8.121208	0.1093717	9.4590848	7.197149	1.201475	0.5750161	3.4405292	0.3698691	0.2766301
13475+	12.235954	24.9094020	3.1530283	8.745215	4.963990	5.3764002	4.4119727	3.6062240	23.5412183
13527+	1.840807	0.2460863	0.9916782	14.204184	5.469874	6.6126847	0.0404768	0.3082243	21.0792108
13552+	17.541809	6.7263588	0.9662506	8.364988	7.019145	5.4626526	8.0548859	0.4931588	0.0276630
13559+	6.172118	3.8553520	0.3559871	16.376908	1.517653	4.1113649	4.2905422	0.0616449	9.3224331

Normalize and estimate dispersion

Normalization

dge3.filter <- calcNormFactors(dge3.filter, method = "TMM")

Dispersion

disp <- estimateDisp(dge3.filter)
#> Using classic mode.

Exact tests

Comparison WT vs RNase Y deletion mutant

et <- exactTest(disp, pair = c("WT", "Y"))
res_WT <- et$table 
res_WT$fdr <- p.adjust(res_WT$PValue, method = "fdr") # Adjusting p.values
res_WT$Position <- as.integer(str_sub(rownames(res_WT), 1, -2)) # Retrieving positions using stocked into the rownames
res_WT$strand <- str_sub(rownames(res_WT), -1) # Retrieving strand
res_WT %>% 
  head(15) %>%
  knitr::kable()

	logFC	logCPM	PValue	fdr	Position	strand
205+	0.4757538	4.097125	0.6161737	1.0000000	205	+
2540+	-1.2969346	1.768362	0.3370351	1.0000000	2540	+
3452+	0.0168134	6.936504	0.9846036	1.0000000	3452	+
4499+	-0.4542850	1.736353	0.7499819	1.0000000	4499	+
4512+	0.0787080	3.143991	0.9482911	1.0000000	4512	+
11902+	1.0497216	3.917431	0.3388455	1.0000000	11902	+
12203+	-0.0990961	2.468814	0.9318575	1.0000000	12203	+
13365+	0.2956658	2.204602	0.8001578	1.0000000	13365	+
13441+	-0.3248103	1.752619	0.8400716	1.0000000	13441	+
13446+	0.0078938	2.913511	0.9962916	1.0000000	13446	+
13451+	-0.8812748	1.583629	0.5850758	1.0000000	13451	+
13475+	-0.9495031	3.505009	0.4234994	1.0000000	13475	+
13527+	3.3057756	2.826576	0.0722374	0.5765283	13527	+
13552+	-0.0277194	2.451620	0.9866344	1.0000000	13552	+
13559+	1.1973548	2.436253	0.4545305	1.0000000	13559	+

Comparison supplemented RNase Y deletion mutant vs. RNase Y deletion mutant

et <- exactTest(disp, pair = c("suppl", "Y"))
res_suppl <- et$table 
res_suppl$fdr <- p.adjust(res_suppl$PValue, method = "fdr") # Adjusting p.values
res_suppl$Position <- as.integer(str_sub(rownames(res_suppl),1,-2)) # Retrieving positions using stocked into the rownames
res_suppl$strand <- str_sub(rownames(res_suppl),-1) # Retrieving strand
res_suppl %>% 
  head(15) %>%
  knitr::kable()

	logFC	logCPM	PValue	fdr	Position	strand
205+	-0.2700981	4.097125	0.7756757	0.9841472	205	+
2540+	-1.1317669	1.768362	0.4033664	0.8990873	2540	+
3452+	-0.2658754	6.936504	0.7586034	0.9797373	3452	+
4499+	1.4232888	1.736353	0.3289756	0.8528307	4499	+
4512+	0.7377628	3.143991	0.5398056	0.9480726	4512	+
11902+	0.0581356	3.917431	0.9581940	0.9984513	11902	+
12203+	0.7146266	2.468814	0.5326933	0.9457086	12203	+
13365+	0.5319706	2.204602	0.6509544	0.9710619	13365	+
13441+	1.4779476	1.752619	0.3635666	0.8768117	13441	+
13446+	-0.0561098	2.913511	0.9706936	0.9984513	13446	+
13451+	1.0310394	1.583629	0.5285491	0.9456388	13451	+
13475+	-1.3589715	3.505009	0.2557154	0.7890100	13475	+
13527+	-0.5195980	2.826576	0.7595278	0.9799139	13527	+
13552+	1.4159979	2.451620	0.3564504	0.8744636	13552	+
13559+	0.0373990	2.436253	0.9827840	0.9990971	13559	+

Visualization

res_WT %>% 
  ggplot(aes(x = logFC, y = -log10(fdr), color = logCPM)) +
  geom_point(alpha = 0.5) +
  theme_bw() + 
  geom_vline(xintercept = -1, col = "red") +
  geom_hline(yintercept = -log10(0.05), col = "red")

Merge comparisons and select significant positions

To identify a position as a RNase Y endonucleolytic cleavage site, Broglia et al. keep only positions where the fdr is lower than 0.01 and logFC lower of equal to -1 in both comparison we’ve made.

So we are going to merge our both result tables then filter following these criterias.

res <- res_suppl %>% 
  inner_join(res_WT, by = c("Position","strand"), suffix = c(".suppl", ".WT")) 

Number of cleavage sites identified:

endoY <- res %>% 
  filter(fdr.WT<0.05 & fdr.suppl<0.05 & logFC.WT<= -1 & logFC.suppl<= -1)
endoY %>%
  nrow
#> [1] 532

Number of clusters of cleavage sites identified:

endoY <- endoY %>% 
 arrange(Position) %>%
 mutate(ID = cumsum(c(1, diff(Position) >= 5))) %>% 
 arrange(desc(ID)) 
endoY %>% 
  group_by(ID) %>% 
  n_groups
#> [1] 381

Comparison with Broglia results

We provide a small table of RNase Y cleavage sites identifed by Broglia et al. using 5’-end RNA-Seq in the extdata directory our GitLab project in order to make the comparison.

Number of RNase Y cleavage sites identified using 5’-end RNA-Seq:

csy <- read_csv("../extdata/Spyo_Broglia/RNaseY_cleavage_site_5Prime.csv", show_col_types = F)
csy %>% 
  nrow
#> [1] 190

Number of RNase Y cleavage sites found in common:

common <- endoY %>% 
  inner_join(csy, by="Position")
common %>%
  nrow
#> [1] 119

Number of RNase Y cleavage sites not found by us:

not_found <- csy %>% 
  anti_join(endoY, by ="Position")
not_found %>%
  nrow
#> [1] 71

Number of sites found not significant:

res %>% 
  filter(Position %in% not_found$Position) %>% 
  nrow
#> [1] 32

Number of sites not tested due to CPM filtering:

not_found %>% 
  anti_join(res, by = "Position") %>% 
  nrow 
#> [1] 39