TSS identification using 5’-end RNA-Seq data

Complete source of this page is available in the notebook TSS_identification.Rmd.

library(tidyverse)
library(rnaends)

Based on the publication from Prados J, Linder P, Redder P (2016) TSS-EMOTE, a refined protocol for a more complete and less biased global mapping of transcription start sites in bacterial pathogens. BMC Genomics, 17, 849. https://doi.org/10.1186/s12864-016-3211-3.

The library preparation protocol is as follows. Briefly, in this protocol, total RNA is purified and digested with a 5’-exonuclease which removes the vast majority of 5’ mono-phosphorylated RNAs. The RNA is mixed with a synthetic RNA oligo (specific adapter) and split into two pools. In the first pool (noted +RppH), both a ligase and an RNA 5’ pyrophosphohydrolase RppH enzymes are added, while in the second pool (labeled -RppH) only the ligase is added. RppH converts the 5’PPP to 5’P that can be used by the ligase to add the adapter oligo to the RNA sequences (for detailed protocol see (Prados et al., 2016)). The “-RppH” pool serves as control to measure the level of background ligation and avoid false positive TSS detection. Thus, a given genome position will be predicted as a TSS if the number of reads from the “+RppH” pool mapping at that position is significantly higher than the number of reads from the “-RppH” pool by applying a beta-binomial test as described in (Prados et al., 2016)

Download datasets

Data from this article are available on Gene Expression Omnibus (GEO) database: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85110

For this notebook, only 4 samples will be used that correspond to 5’-end RNA-Seq data from Staphylococcus aureus (MW2 strain) grown on RPMI liquid at 37°C (GSM2257851, GSM2257852, GSM2257859, GSM2257860).

We download data directly in a the data directory of the GitLab project in the sub-directory called TSS-EMOTE_Prados where the reference genome is already provided when the project is cloned.

options(timeout = 3600) # for download time, set more if timeout occurs

list_SRR <- c("SRR3994381", "SRR3994382", "SRR3994390", "SRR3994389")

for (i in list_SRR){
    raw_fastq <- paste0("../extdata/TSS-EMOTE_Prados/", i, ".fastq.gz")
    if (!file.exists(raw_fastq))
        download.file(paste0("https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=",i),
                      destfile = raw_fastq)
}

Pre-processing of 5’-end RNA-Seq reads

Parse reads (+ read features)

Reads are not ready-to-map yet since they still have an adapter (the RP6 oligonucleotide) located at the 5’-end of each read.

We use parse_reads, in order to check that each read contains the expected elements in the adapter, the remove this oligonucleotide from the reads (see RP6 oligo TSS-EMOTE).

parse_reads requires a read_features table which defines the expected elements on the read.

Following the Materials & Methods section of the article, the structure of the reads can be schematized as follows (X is the mappable part):

      1       2            5      6            20     21    27      28 30     31                  50
      N     - (EMOTE barcode) -   CGGCACCAACCGAGG    - VVVVVVV -     CGC     - XXXXXXXXXXXXXXXXXXXXX
1 random nt -  EMOTE barcode  -   recognition.seq    -   UMI   - control.seq -     RNA 5'-end

We use init_read_features and add_read_feature functions to create the read_features table following the above structure.

The initialization of the read_features table is made by init_read_features since the only required feature is the readseq feature (the mappable part to keep in the FASTQ file).

According to our scheme, we know that the readseq part begins at location 31 and are 20 nucleotides long (illumina 50bp).

rf <- init_read_features(start = 31, width = 20)

We add other features such as the barcode which begins at position 2 and are 4 bases long with the following specifications:

  • pattern : enables to specify which barcode sequence(s) are allowed, here we allowed 4 different sequences

  • pattern_type: is used to know which type of pattern it is, “constant” means that we check the exact pattern sequences allowing N number of mismatches (specified by the max_mismatch parameter)

  • readid_prepend: is used to store the retrieved feature in the read identifier. It will be useful to keep a track of features since they are removed from the reads and therefore absent from the FASTQ output file.

rf <- add_read_feature(rf,
                       name = "barcode",
                       start = 2,
                       width = 4,
                       pattern = c("TCGG","CAAG","CGTC","AAGT"),
                       pattern_type = "constant",
                       max_mismatch = 0,
                       readid_prepend = FALSE)

The recognition sequence feature begins at position 6 and is 15 bases long, and we expect a single unique sequence with maximum 1 mismatch:

rf <- add_read_feature(rf,
                       name = "RS",
                       start = 6,
                       width = 15,
                       pattern = "CGGCACCAACCGAGG",
                       pattern_type = "constant",
                       max_mismatch = 1,
                       readid_prepend = FALSE)

The UMI feature begins at position 21 and is 7 bases long. Its pattern type is different from the previous ones because: as UMIs are random sequences, we do not expect a particular sequence (nor list of sequences) but a sequence of random characters (A, C or G) hence the patern_type = “nucleotides”. We set readid_prepend = TRUE because we will need the UMIs to remove PCR duplicates introduced by the library preparation protocol.

Note: the readseq feature has the same pattern_type than UMI and is set by default to A,C,T or G.

rf <- add_read_feature(rf,
                       name = "UMI",
                       start = 21,
                       width = 7,
                       pattern = "ACG",
                       pattern_type = "nucleotides",
                       max_mismatch = 0,
                       readid_prepend = TRUE)

The control sequence feature begins at position 28 and is 3 bases long, and we expect a specific and unique sequence:

rf <- add_read_feature(rf,
                       name = "CS",
                       start = 28,
                       width = 3,
                       pattern = "CGC",
                       pattern_type = "constant",
                       max_mismatch = 0,
                       readid_prepend = FALSE)

The resulting read_features table is:

rf %>%
  knitr::kable()

name

start

width

pattern_type

pattern

max_mismatch

readid_prepend

readseq

31

20

nucleotides

ACTG

0

FALSE

barcode

2

4

constant

TCGG, CA….

0

FALSE

RS

6

15

constant

CGGCACCA….

1

FALSE

UMI

21

7

nucleotides

ACG

0

TRUE

CS

28

3

constant

CGC

0

FALSE

Now that we have built the read_features table, we can parse the 4 FASTQ files with the parse_reads function.

Note: mclapply function from the parallel package is used to perform all the 4 processing in parallel but lapply can be used instead.

parsing_report <- lapply(
    list.files("../extdata/TSS-EMOTE_Prados/", pattern = ".fastq.gz$", full.names = TRUE),
    FUN = parse_reads,
    features = rf, # The read_features table
    keep_invalid_reads = FALSE, # do not store invalid reads
    out_dir = "../extdata/TSS-EMOTE_Prados/parse_results", # Path of the out directory (optional)
    force = FALSE
)
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994381_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994382_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994389_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994390_parse_report.csv
parsing_report %>%
  bind_rows %>%
  knitr::kable()

is_valid_readseq

is_valid_barcode

is_valid_RS

is_valid_UMI

is_valid_CS

count

FALSE

FALSE

TRUE

TRUE

FALSE

1

FALSE

TRUE

TRUE

TRUE

TRUE

1444

TRUE

FALSE

FALSE

TRUE

FALSE

272

TRUE

FALSE

TRUE

TRUE

FALSE

2668

TRUE

TRUE

TRUE

TRUE

TRUE

5189855

FALSE

FALSE

FALSE

TRUE

FALSE

3

FALSE

FALSE

TRUE

TRUE

FALSE

3

FALSE

TRUE

TRUE

TRUE

TRUE

2565

TRUE

FALSE

FALSE

TRUE

FALSE

3166

TRUE

FALSE

TRUE

TRUE

FALSE

21274

TRUE

TRUE

TRUE

TRUE

FALSE

1320

TRUE

TRUE

TRUE

TRUE

TRUE

8561052

FALSE

FALSE

FALSE

TRUE

FALSE

4

FALSE

TRUE

TRUE

TRUE

TRUE

2378

TRUE

FALSE

FALSE

TRUE

FALSE

2789

TRUE

FALSE

TRUE

TRUE

FALSE

1308

TRUE

TRUE

TRUE

TRUE

TRUE

3177298

FALSE

FALSE

FALSE

TRUE

FALSE

1

FALSE

FALSE

TRUE

TRUE

FALSE

1

FALSE

TRUE

TRUE

TRUE

TRUE

2876

TRUE

FALSE

FALSE

TRUE

FALSE

2924

TRUE

FALSE

TRUE

TRUE

FALSE

1237

TRUE

TRUE

TRUE

TRUE

TRUE

3774778

The parsing_report provides information on the validity of reads in each FASTQ file.

We can aggregate and visualize the global statititics for all reports with an upset plot:

parsing_report %>%
  bind_rows %>%
  group_by(across(starts_with("is_valid_"))) %>%
  summarise(count = sum(count), .groups = "drop") %>%
  reads_parse_report_ggupset(angle = 45, margin=0.3)

Quantification

Mapping

Reads are now ready-to-map on the reference genome using the map_fastq_files function.

Note: here, we use the function that allow the mapping of multiple FASTQ files in parallel which is a wrapper that calls the map_fastq function for each of the FASTQ files provided.

bf <- map_fastq_files(
  list_fastq_files = list.files("../extdata/TSS-EMOTE_Prados/parse_results", 
                                pattern = "_valid.fastq.gz$", 
                                full.names = TRUE),
  index_name = "../extdata/TSS-EMOTE_Prados/MW2.sr", # Path of the genome index (created if not exists)
  fasta_name = "../extdata/TSS-EMOTE_Prados/MW2.fasta", # Path of the reference genome file
  bam_dir = "../extdata/TSS-EMOTE_Prados/mapping_results_subread", # Output directory
  max_map_mismatch = 1,
  aligner = "subread", # either bowtie or subread
  map_unique = FALSE,
  force = FALSE,
  threads = 3)
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
bf
#> [[1]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam"
#> 
#> [[2]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam"
#> 
#> [[3]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam"
#> 
#> [[4]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam"

Mapping statistics

Use the map_stats_bam_files function to get some basic statistics on the mapping we’ve just performed:

map_stats_bam_files(bam_files = bf) %>%
  knitr::kable()
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute

bam_file

mapped

unmapped

pc_mapped

SRR3994381_valid.fastq.gz.bam

2001012

3188843

38.56

SRR3994382_valid.fastq.gz.bam

3436440

5124612

40.14

SRR3994389_valid.fastq.gz.bam

781940

2395358

24.61

SRR3994390_valid.fastq.gz.bam

1433740

2341038

37.98

Quantification (counts computation)

In order to obtain counts per position, we use the quantify_from_bam_files function to parse previously obtained bam_files.

This function allows the quantification of either the 5’-ends (quantify_5prime) or the 3’-ends (quantify_3prime), in both cases, it is an exact position quantification.

For UMIs to be taken into account for the quantification, we provide the read_features table with the “features” parameter. The keep_UMI parameter is used to record the number of copy of each UMI by exact position. This level of information is not needed for this analysis and we use it as default (FALSE).

count_table <- quantify_from_bam_files(bam_files = bf,
                                       sample_ids = c("rep1-", "rep1+", "rep2-", "rep2+"),
                                       fun = quantify_5prime, # Either quantify_5prime or quantify_3prime
                                       features = rf,
                                       keep_UMI = FALSE)
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.csv
count_table %>% 
  head(15) %>%
  knitr::kable()

sample_id

rname

position

strand

reads

count

rep1-

NC_003923.1

313

+

2

2

rep1-

NC_003923.1

425

+

34

9

rep1-

NC_003923.1

535

+

11

1

rep1-

NC_003923.1

637

+

1

1

rep1-

NC_003923.1

689

+

1

1

rep1-

NC_003923.1

791

+

2

1

rep1-

NC_003923.1

868

+

2

2

rep1-

NC_003923.1

911

+

2

1

rep1-

NC_003923.1

918

+

6

1

rep1-

NC_003923.1

1009

+

3

1

rep1-

NC_003923.1

1166

+

2

1

rep1-

NC_003923.1

1309

+

10

1

rep1-

NC_003923.1

1310

+

3

1

rep1-

NC_003923.1

1357

-

1

1

rep1-

NC_003923.1

1890

+

1

1

The above count table will allow us to perform some downstream analysis, i.e. TSS identification.

Note: here, the count column corresponds to the number of different UMIs mapped at this exact position. This is why it is lower or equal to the reads values (due to PCR duplicates).

Downstream analysis: TSS identification

Design table

The identification of TSS requires the comparison of 5’P and 5’P+5’PPP samples with multiple replicates. We thus build a design table to identify which samples is what.

The content of the design table is free as it includes the colnames:

  • sample_id (same as in the count table)

  • pool (library preparation for sequencing either “mono” for 5’P or “total” for 5’P+5’PPP)

  • replicate (p-values are combined by replicate)

design_tb <- tibble(
  sample_id = c("rep1-", "rep1+", "rep2-", "rep2+"),
  count_of = c("fraction", "total", "fraction", "total" ),
  replicate = c(1, 1, 2, 2)
)
design_tb %>%
  knitr::kable()

sample_id

count_of

replicate

rep1-

fraction

1

rep1+

total

1

rep2-

fraction

2

rep2+

total

2

Identification of TSS

We finally use the betabinom_TSS function to identify TSSs with the count table and the design table freshly built as inputs.

Note: the filtering of data is made by the function (and not before) because it needs the total number of counts before filtering.

bbin <- TSS(count_table = count_table,
            design_table = design_tb,
            threshold = 5, # Number minimum reads allowed in the total sample
            combine_method = "fisher", # Method for the p-values combination
            fdr_method = "fdr" # Method for adjusting p-value (see p.adjust)
            )
bbin %>% 
  head(15) %>%
  knitr::kable()

rname

position

strand

pvalue

pvalue_adj

NC_003923.1

312

+

0.0017109

0.0071046

NC_003923.1

313

+

0.0000000

0.0000000

NC_003923.1

314

+

0.0000001

0.0000008

NC_003923.1

1642

+

0.0725677

0.1682600

NC_003923.1

2131

+

0.0001241

0.0006330

NC_003923.1

3649

+

0.0000009

0.0000058

NC_003923.1

3650

+

0.0000000

0.0000000

NC_003923.1

3652

+

0.0000001

0.0000006

NC_003923.1

4128

+

0.0025859

0.0103542

NC_003923.1

4159

-

0.0000194

0.0001102

NC_003923.1

4301

+

0.1746218

0.3408392

NC_003923.1

6734

+

0.0182064

0.0576161

NC_003923.1

10687

-

0.0000000

0.0000000

NC_003923.1

10855

+

0.0000004

0.0000030

NC_003923.1

10856

+

0.1536700

0.3042540

Number of TSSs identified with an adjusted p-value (FDR) < 0.01: 2677.

Clusters of TSS

As in the Prados et al. paper we can clusterise TSSs by regrouping those less than 5 nucleotides apart.

bbin <- bbin %>% 
  filter(pvalue_adj < 0.01) %>% 
  arrange(position) %>%
   mutate(cluster_id = cumsum(c(1, diff(position) >= 5)))
bbin %>%
  head(15) %>%
  knitr::kable()

rname

position

strand

pvalue

pvalue_adj

cluster_id

NC_003923.1

312

+

0.0017109

0.0071046

1

NC_003923.1

313

+

0.0000000

0.0000000

1

NC_003923.1

314

+

0.0000001

0.0000008

1

NC_003923.1

2131

+

0.0001241

0.0006330

2

NC_003923.1

3649

+

0.0000009

0.0000058

3

NC_003923.1

3650

+

0.0000000

0.0000000

3

NC_003923.1

3652

+

0.0000001

0.0000006

3

NC_003923.1

4159

-

0.0000194

0.0001102

4

NC_003923.1

10687

-

0.0000000

0.0000000

5

NC_003923.1

10855

+

0.0000004

0.0000030

6

NC_003923.1

12473

+

0.0000000

0.0000000

7

NC_003923.1

14672

+

0.0000634

0.0003335

8

NC_003923.1

15797

+

0.0000000

0.0000000

9

NC_003923.1

17356

+

0.0000000

0.0000000

10

NC_003923.1

17357

+

0.0000000

0.0000000

10

Number of cluster of TSSs identified with an adjusted p-value (fdr) < 0.01: 1923.