Endonucleolytic cleavage site prediction using 5’-end RNA-Seq data

Complete source of this page is available in the notebook Endoribonucleolytic_sites_prediction.Rmd.

Based on the publication from Broglia L, Lécrivain A-L, Renault TT, Hahnke K, Ahmed-Begrich R, Le Rhun A, Charpentier E (2020) An RNA-seq based comparative approach reveals the transcriptome-wide interplay between 3′-to-5′ exoRNases and RNase Y. Nature Communications, 11, 1587. https://doi.org/10.1038/s41467-020-15387-6

Download datasets

Data from this article are publicly available on the SRA NCBI repository: https://www.ncbi.nlm.nih.gov/sra?term=SRP149896

For this notebook, we will use only 9 RNA-Seq samples corresponding to 9 datasets of Streptococcus pyogenes 5’-end RNA-Seq (SRX4173017, SRX4173016, SRX4173015, SRX4173014, SRX4173013, SRX4173012, SRX4173011, SRX4173010, SRX4173009).

As the datasets are too bug for direct download, fasterq-dump (instead of basic “curl/wget” command) from ncbi/sra-tools must be used to retrieve FASTQ files directly in the extdata directory of the GitLab project in the sub-directory called Spyo_Broglia where the reference genome is already provided when the project is clone:

for i in SRR7269351	SRR7269352 SRR7269353 SRR7269354 SRR7269355 SRR7269356 SRR7269357 SRR7269358 SRR7269359 ; do 
  fasterq-dump $i --outdir ../extdata/Spyo_Broglia ; 
done

We remove all reverse files (rm SRR7236935._2.fastq.gz) and we gzip all forward files.

library(tidyverse)
library(rnaends)

Pre-processing

Thse datasets were already pre-processed when they have been submitted, so we can directly map the FASTQ files on the reference genome.

Quantification

We use the map_fastq_files to map the 9 FASTQ files in a single command.

bf <- map_fastq_files(list_fastq_files = list.files("../extdata/Spyo_Broglia/",
                                                    pattern = ".fastq.gz",
                                                    full.names = TRUE),
                      index_name = "../extdata/Spyo_Broglia/data/spyo_SF370.bt", # Path of the genome index (created if not exists)
                      fasta_name = "../extdata/Spyo_Broglia/spyo_SF370.fasta", # Path of the reference genome file
                      bam_dir = "../extdata/Spyo_Broglia/mapping_results", # Output directory
                      max_map_mismatch = 1,
                      aligner = "bowtie", # Either bowtie or subread
                      map_unique = FALSE,
                      force = F,
                      threads = 3)
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam' already exists, set force=TRUE to recompute
bf
#> [[1]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam"
#> 
#> [[2]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam"
#> 
#> [[3]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam"
#> 
#> [[4]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam"
#> 
#> [[5]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam"
#> 
#> [[6]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam"
#> 
#> [[7]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam"
#> 
#> [[8]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam"
#> 
#> [[9]]
#> [1] "../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam"

Mapping statistics

The map_stats_bam_files function provides some basic statistics on the mapping:

map_stats_bam_files(bam_files = bf) %>%
  knitr::kable()
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute

bam_file

mapped

unmapped

pc_mapped

SRR7269351_1.fastq.gz.bam

41814799

988534

97.69

SRR7269352_1.fastq.gz.bam

41739505

938889

97.80

SRR7269353_1.fastq.gz.bam

43357081

1019019

97.70

SRR7269354_1.fastq.gz.bam

43026077

1016127

97.69

SRR7269355_1.fastq.gz.bam

34567930

650684

98.15

SRR7269356_1.fastq.gz.bam

37802063

705215

98.17

SRR7269357_1.fastq.gz.bam

26960617

641089

97.68

SRR7269358_1.fastq.gz.bam

34908063

977130

97.28

SRR7269359_1.fastq.gz.bam

40858741

759768

98.17

Quantification (counts computation)

In order to obtain counts per position, we use the quantify_from_bam_files function to parse previoulsy obtained bam_files.

This function allows the quantification of either the 5’-ends (quantify_5prime) or the 3’-ends (quantify_3prime), in all cases it’s an exact position quantification.

Run with minimal parameters:

count_table <- quantify_from_bam_files(bam_files = bf,
                                       sample_ids = c("WT1", "WT2", "WT3", 
                                                    "Y1", "Y2", "Y3", 
                                                    "suppl1", "suppl2", "suppl3"), # Set an identifier to know the source of each values
                                       fun = quantify_5prime)
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269351_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269352_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269353_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269354_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269355_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269356_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269357_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269358_1.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/Spyo_Broglia/mapping_results/SRR7269359_1.fastq.gz.bam.csv

count_table %>% 
  head(15) %>%
  knitr::kable()

sample_id

rname

position

strand

count

WT1

AE004092.2

38

+

2

WT1

AE004092.2

205

+

819

WT1

AE004092.2

206

+

13

WT1

AE004092.2

211

+

7

WT1

AE004092.2

215

+

8

WT1

AE004092.2

217

+

34

WT1

AE004092.2

218

+

2

WT1

AE004092.2

219

+

2

WT1

AE004092.2

219

-

5

WT1

AE004092.2

236

+

15

WT1

AE004092.2

248

+

31

WT1

AE004092.2

253

+

13

WT1

AE004092.2

254

+

1

WT1

AE004092.2

262

+

14

WT1

AE004092.2

265

+

117

Downstream analysis

Differential abundances for prediction of cleavage sites

As in the article, we’re going to carry out a differential gene expression analysis (DGE), but at the resolution of an exact position to detect RNase Y endonucleolytic cleavage sites by comparing 3 S. pyogenes strains:

  • Wild-type

  • RNase Y deletion mutant

  • RNAse Y deletion mutant supplemented

To perform a DGE analysis the “edgeR” package is be used (which is not part of the requirements of the rnaends package).

We use as input data the count table we’ve just produced.

library(edgeR)

Pivot the count table to create a DGEList object

group1 <- c("WT", "WT", "WT")
group2 <- c("Y", "Y", "Y")
group3 <- c("suppl", "suppl", "suppl")

# Pivot count table
data <- count_table %>% 
  pivot_wider(id_cols = c(position, strand),
              names_from = sample_id,
              values_from = count, values_fill = 0)

# Transform into matrix for DGElist
mat <- data[,3:ncol(data)] %>% 
  as.matrix()
rownames(mat) <- paste0(data$position, data$strand)
mat %>% 
  head(15) %>%
  knitr::kable()

WT1

WT2

WT3

Y1

Y2

Y3

suppl1

suppl2

suppl3

38+

2

0

0

0

0

0

0

0

0

205+

819

491

358

1001

345

568

212

304

976

206+

13

22

5

44

17

117

3

4

21

211+

7

11

5

35

0

0

1

1

0

215+

8

1

10

0

4

4

1

0

0

217+

34

13

7

16

2

8

15

1

10

218+

2

0

0

0

0

0

0

0

0

219+

2

0

1

0

0

0

0

0

0

219-

5

0

0

0

0

0

0

0

0

236+

15

0

0

1

0

1

0

0

31

248+

31

15

2

6

0

9

45

0

1

253+

13

0

0

0

0

1

0

0

0

254+

1

0

0

3

0

0

0

0

0

262+

14

0

6

1

3

14

43

0

0

265+

117

33

1

0

15

11

0

0

0

DGEList and filter count according cpm

dge <- DGEList(counts = mat[,1:9], group = factor(c(group1, group2, group3)) )

# Keep position where there is at least 6 conditions with cpm>0.05
keep <- rowSums(cpm(dge) >= 0.05) >= 6
dge.filter <- dge[keep, ]

# Update dge (for racalcul of cpm)
dge2.filter <- DGEList(dge.filter$counts,
                       group = factor(c(group1, group2, group3)) )

# Keep position where there is at least 3 conditions with cpm>5
keep <- rowSums(cpm(dge2.filter) >= 5) >= 3
dge3.filter <- dge2.filter[keep, ] 

cpm(dge3.filter) %>% 
  head(15) %>% knitr::kable()

WT1

WT2

WT3

Y1

Y2

Y3

suppl1

suppl2

suppl3

205+

22.170898

13.4253747

9.1030977

27.186211

10.908131

16.3304562

8.5810845

9.3700179

26.9990941

2540+

5.820199

2.1054050

7.6028665

1.249316

1.359564

2.5588215

9.9572962

2.3425045

0.2489671

3452+

132.483974

189.8965934

68.4003711

212.057875

77.115743

78.1446831

47.9650241

67.0079572

177.7071523

4499+

6.253330

8.1481908

2.5173371

8.609419

1.960302

1.3225369

3.0357610

1.1712522

0.0276630

4512+

20.790293

7.3825889

7.2977348

14.828842

13.121375

4.6576301

11.6573223

7.3665602

0.1659780

11902+

12.777367

6.4255867

12.4341195

28.082459

14.259615

15.2954273

2.0238407

6.8117565

26.1138779

12203+

10.124439

8.6403634

4.2718447

4.725675

6.797821

6.6414355

1.4571653

7.6131396

0.8852162

13365+

8.418986

3.1991219

4.3481277

5.377492

7.493412

3.7376044

3.3190987

0.7089158

4.5367330

13441+

5.874340

6.9997880

3.6107259

5.784878

2.719128

3.3350932

4.4119727

0.2157570

0.0000000

13446+

12.750297

5.7693565

7.7554324

9.695781

4.584577

8.2514805

24.4075186

1.5719438

0.0276630

13451+

8.121208

0.1093717

9.4590848

7.197149

1.201475

0.5750161

3.4405292

0.3698691

0.2766301

13475+

12.235954

24.9094020

3.1530283

8.745215

4.963990

5.3764002

4.4119727

3.6062240

23.5412183

13527+

1.840807

0.2460863

0.9916782

14.204184

5.469874

6.6126847

0.0404768

0.3082243

21.0792108

13552+

17.541809

6.7263588

0.9662506

8.364988

7.019145

5.4626526

8.0548859

0.4931588

0.0276630

13559+

6.172118

3.8553520

0.3559871

16.376908

1.517653

4.1113649

4.2905422

0.0616449

9.3224331

Normalize and estimate dispersion

Normalization

dge3.filter <- calcNormFactors(dge3.filter, method = "TMM")

Dispersion

disp <- estimateDisp(dge3.filter)
#> Using classic mode.

Exact tests

Comparison WT vs RNase Y deletion mutant

et <- exactTest(disp, pair = c("WT", "Y"))
res_WT <- et$table 
res_WT$fdr <- p.adjust(res_WT$PValue, method = "fdr") # Adjusting p.values
res_WT$Position <- as.integer(str_sub(rownames(res_WT), 1, -2)) # Retrieving positions using stocked into the rownames
res_WT$strand <- str_sub(rownames(res_WT), -1) # Retrieving strand
res_WT %>% 
  head(15) %>%
  knitr::kable()

logFC

logCPM

PValue

fdr

Position

strand

205+

0.4757538

4.097125

0.6161737

1.0000000

205

+

2540+

-1.2969346

1.768362

0.3370351

1.0000000

2540

+

3452+

0.0168134

6.936504

0.9846036

1.0000000

3452

+

4499+

-0.4542850

1.736353

0.7499819

1.0000000

4499

+

4512+

0.0787080

3.143991

0.9482911

1.0000000

4512

+

11902+

1.0497216

3.917431

0.3388455

1.0000000

11902

+

12203+

-0.0990961

2.468814

0.9318575

1.0000000

12203

+

13365+

0.2956658

2.204602

0.8001578

1.0000000

13365

+

13441+

-0.3248103

1.752619

0.8400716

1.0000000

13441

+

13446+

0.0078938

2.913511

0.9962916

1.0000000

13446

+

13451+

-0.8812748

1.583629

0.5850758

1.0000000

13451

+

13475+

-0.9495031

3.505009

0.4234994

1.0000000

13475

+

13527+

3.3057756

2.826576

0.0722374

0.5765283

13527

+

13552+

-0.0277194

2.451620

0.9866344

1.0000000

13552

+

13559+

1.1973548

2.436253

0.4545305

1.0000000

13559

+

Comparison supplemented RNase Y deletion mutant vs. RNase Y deletion mutant

et <- exactTest(disp, pair = c("suppl", "Y"))
res_suppl <- et$table 
res_suppl$fdr <- p.adjust(res_suppl$PValue, method = "fdr") # Adjusting p.values
res_suppl$Position <- as.integer(str_sub(rownames(res_suppl),1,-2)) # Retrieving positions using stocked into the rownames
res_suppl$strand <- str_sub(rownames(res_suppl),-1) # Retrieving strand
res_suppl %>% 
  head(15) %>%
  knitr::kable()

logFC

logCPM

PValue

fdr

Position

strand

205+

-0.2700981

4.097125

0.7756757

0.9841472

205

+

2540+

-1.1317669

1.768362

0.4033664

0.8990873

2540

+

3452+

-0.2658754

6.936504

0.7586034

0.9797373

3452

+

4499+

1.4232888

1.736353

0.3289756

0.8528307

4499

+

4512+

0.7377628

3.143991

0.5398056

0.9480726

4512

+

11902+

0.0581356

3.917431

0.9581940

0.9984513

11902

+

12203+

0.7146266

2.468814

0.5326933

0.9457086

12203

+

13365+

0.5319706

2.204602

0.6509544

0.9710619

13365

+

13441+

1.4779476

1.752619

0.3635666

0.8768117

13441

+

13446+

-0.0561098

2.913511

0.9706936

0.9984513

13446

+

13451+

1.0310394

1.583629

0.5285491

0.9456388

13451

+

13475+

-1.3589715

3.505009

0.2557154

0.7890100

13475

+

13527+

-0.5195980

2.826576

0.7595278

0.9799139

13527

+

13552+

1.4159979

2.451620

0.3564504

0.8744636

13552

+

13559+

0.0373990

2.436253

0.9827840

0.9990971

13559

+

Visualization

res_WT %>% 
  ggplot(aes(x = logFC, y = -log10(fdr), color = logCPM)) +
  geom_point(alpha = 0.5) +
  theme_bw() + 
  geom_vline(xintercept = -1, col = "red") +
  geom_hline(yintercept = -log10(0.05), col = "red")

Merge comparisons and select significant positions

To identify a position as a RNase Y endonucleolytic cleavage site, Broglia et al. keep only positions where the fdr is lower than 0.01 and logFC lower of equal to -1 in both comparison we’ve made.

So we are going to merge our both result tables then filter following these criterias.

res <- res_suppl %>% 
  inner_join(res_WT, by = c("Position","strand"), suffix = c(".suppl", ".WT")) 

Number of cleavage sites identified:

endoY <- res %>% 
  filter(fdr.WT<0.05 & fdr.suppl<0.05 & logFC.WT<= -1 & logFC.suppl<= -1)
endoY %>%
  nrow
#> [1] 532

Number of clusters of cleavage sites identified:

endoY <- endoY %>% 
 arrange(Position) %>%
 mutate(ID = cumsum(c(1, diff(Position) >= 5))) %>% 
 arrange(desc(ID)) 
endoY %>% 
  group_by(ID) %>% 
  n_groups
#> [1] 381

Comparison with Broglia results

We provide a small table of RNase Y cleavage sites identifed by Broglia et al. using 5’-end RNA-Seq in the extdata directory our GitLab project in order to make the comparison.

Number of RNase Y cleavage sites identified using 5’-end RNA-Seq:

csy <- read_csv("../extdata/Spyo_Broglia/RNaseY_cleavage_site_5Prime.csv", show_col_types = F)
csy %>% 
  nrow
#> [1] 190

Number of RNase Y cleavage sites found in common:

common <- endoY %>% 
  inner_join(csy, by="Position")
common %>%
  nrow
#> [1] 119

Number of RNase Y cleavage sites not found by us:

not_found <- csy %>% 
  anti_join(endoY, by ="Position")
not_found %>%
  nrow
#> [1] 71

Number of sites found not significant:

res %>% 
  filter(Position %in% not_found$Position) %>% 
  nrow
#> [1] 32

Number of sites not tested due to CPM filtering:

not_found %>% 
  anti_join(res, by = "Position") %>% 
  nrow 
#> [1] 39