TSS identification using 5’-end RNA-Seq data

Complete source of this page is available in the notebook TSS_identification.Rmd.

library(tidyverse)

library(rnaends)

Based on the publication from Prados J, Linder P, Redder P (2016) TSS-EMOTE, a refined protocol for a more complete and less biased global mapping of transcription start sites in bacterial pathogens. BMC Genomics, 17, 849. https://doi.org/10.1186/s12864-016-3211-3.

The library preparation protocol is as follows. Briefly, in this protocol, total RNA is purified and digested with a 5’-exonuclease which removes the vast majority of 5’ mono-phosphorylated RNAs. The RNA is mixed with a synthetic RNA oligo (specific adapter) and split into two pools. In the first pool (noted +RppH), both a ligase and an RNA 5’ pyrophosphohydrolase RppH enzymes are added, while in the second pool (labeled -RppH) only the ligase is added. RppH converts the 5’PPP to 5’P that can be used by the ligase to add the adapter oligo to the RNA sequences (for detailed protocol see (Prados et al., 2016)). The “-RppH” pool serves as control to measure the level of background ligation and avoid false positive TSS detection. Thus, a given genome position will be predicted as a TSS if the number of reads from the “+RppH” pool mapping at that position is significantly higher than the number of reads from the “-RppH” pool by applying a beta-binomial test as described in (Prados et al., 2016)

Download datasets

Data from this article are available on Gene Expression Omnibus (GEO) database: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85110

For this notebook, only 4 samples will be used that correspond to 5’-end RNA-Seq data from Staphylococcus aureus (MW2 strain) grown on RPMI liquid at 37°C (GSM2257851, GSM2257852, GSM2257859, GSM2257860).

We download data directly in a the data directory of the GitLab project in the sub-directory called TSS-EMOTE_Prados where the reference genome is already provided when the project is cloned.

options(timeout = 3600) # for download time, set more if timeout occurs

list_SRR <- c("SRR3994381", "SRR3994382", "SRR3994390", "SRR3994389")

for (i in list_SRR){
    raw_fastq <- paste0("../extdata/TSS-EMOTE_Prados/", i, ".fastq.gz")
    if (!file.exists(raw_fastq))
        download.file(paste0("https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=",i),
                      destfile = raw_fastq)
}

Pre-processing of 5’-end RNA-Seq reads

Parse reads (+ read features)

Reads are not ready-to-map yet since they still have an adapter (the RP6 oligonucleotide) located at the 5’-end of each read.

We use parse_reads, in order to check that each read contains the expected elements in the adapter, the remove this oligonucleotide from the reads (see RP6 oligo TSS-EMOTE).

parse_reads requires a read_features table which defines the expected elements on the read.

Following the Materials & Methods section of the article, the structure of the reads can be schematized as follows (X is the mappable part):

      1       2            5      6            20     21    27      28 30     31                  50
      N     - (EMOTE barcode) -   CGGCACCAACCGAGG    - VVVVVVV -     CGC     - XXXXXXXXXXXXXXXXXXXXX
1 random nt -  EMOTE barcode  -   recognition.seq    -   UMI   - control.seq -     RNA 5'-end

We use init_read_features and add_read_feature functions to create the read_features table following the above structure.

The initialization of the read_features table is made by init_read_features since the only required feature is the readseq feature (the mappable part to keep in the FASTQ file).

According to our scheme, we know that the readseq part begins at location 31 and are 20 nucleotides long (illumina 50bp).

rf <- init_read_features(start = 31, width = 20)

We add other features such as the barcode which begins at position 2 and are 4 bases long with the following specifications:

pattern : enables to specify which barcode sequence(s) are allowed, here we allowed 4 different sequences
pattern_type: is used to know which type of pattern it is, “constant” means that we check the exact pattern sequences allowing N number of mismatches (specified by the max_mismatch parameter)
readid_prepend: is used to store the retrieved feature in the read identifier. It will be useful to keep a track of features since they are removed from the reads and therefore absent from the FASTQ output file.

rf <- add_read_feature(rf,
                       name = "barcode",
                       start = 2,
                       width = 4,
                       pattern = c("TCGG","CAAG","CGTC","AAGT"),
                       pattern_type = "constant",
                       max_mismatch = 0,
                       readid_prepend = FALSE)

The recognition sequence feature begins at position 6 and is 15 bases long, and we expect a single unique sequence with maximum 1 mismatch:

rf <- add_read_feature(rf,
                       name = "RS",
                       start = 6,
                       width = 15,
                       pattern = "CGGCACCAACCGAGG",
                       pattern_type = "constant",
                       max_mismatch = 1,
                       readid_prepend = FALSE)

The UMI feature begins at position 21 and is 7 bases long. Its pattern type is different from the previous ones because: as UMIs are random sequences, we do not expect a particular sequence (nor list of sequences) but a sequence of random characters (A, C or G) hence the patern_type = “nucleotides”. We set readid_prepend = TRUE because we will need the UMIs to remove PCR duplicates introduced by the library preparation protocol.

Note: the readseq feature has the same pattern_type than UMI and is set by default to A,C,T or G.

rf <- add_read_feature(rf,
                       name = "UMI",
                       start = 21,
                       width = 7,
                       pattern = "ACG",
                       pattern_type = "nucleotides",
                       max_mismatch = 0,
                       readid_prepend = TRUE)

The control sequence feature begins at position 28 and is 3 bases long, and we expect a specific and unique sequence:

rf <- add_read_feature(rf,
                       name = "CS",
                       start = 28,
                       width = 3,
                       pattern = "CGC",
                       pattern_type = "constant",
                       max_mismatch = 0,
                       readid_prepend = FALSE)

The resulting read_features table is:

rf %>%
  knitr::kable()

name	start	width	pattern_type	pattern	max_mismatch	readid_prepend
readseq	31	20	nucleotides	ACTG	0	FALSE
barcode	2	4	constant	TCGG, CA….	0	FALSE
RS	6	15	constant	CGGCACCA….	1	FALSE
UMI	21	7	nucleotides	ACG	0	TRUE
CS	28	3	constant	CGC	0	FALSE

Now that we have built the read_features table, we can parse the 4 FASTQ files with the parse_reads function.

Note: mclapply function from the parallel package is used to perform all the 4 processing in parallel but lapply can be used instead.

parsing_report <- lapply(
    list.files("../extdata/TSS-EMOTE_Prados/", pattern = ".fastq.gz$", full.names = TRUE),
    FUN = parse_reads,
    features = rf, # The read_features table
    keep_invalid_reads = FALSE, # do not store invalid reads
    out_dir = "../extdata/TSS-EMOTE_Prados/parse_results", # Path of the out directory (optional)
    force = FALSE
)
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994381_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994382_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994389_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994390_parse_report.csv

parsing_report %>%
  bind_rows %>%
  knitr::kable()

is_valid_readseq	is_valid_barcode	is_valid_RS	is_valid_UMI	is_valid_CS	count
FALSE	FALSE	TRUE	TRUE	FALSE	1
FALSE	TRUE	TRUE	TRUE	TRUE	1444
TRUE	FALSE	FALSE	TRUE	FALSE	272
TRUE	FALSE	TRUE	TRUE	FALSE	2668
TRUE	TRUE	TRUE	TRUE	TRUE	5189855
FALSE	FALSE	FALSE	TRUE	FALSE	3
FALSE	FALSE	TRUE	TRUE	FALSE	3
FALSE	TRUE	TRUE	TRUE	TRUE	2565
TRUE	FALSE	FALSE	TRUE	FALSE	3166
TRUE	FALSE	TRUE	TRUE	FALSE	21274
TRUE	TRUE	TRUE	TRUE	FALSE	1320
TRUE	TRUE	TRUE	TRUE	TRUE	8561052
FALSE	FALSE	FALSE	TRUE	FALSE	4
FALSE	TRUE	TRUE	TRUE	TRUE	2378
TRUE	FALSE	FALSE	TRUE	FALSE	2789
TRUE	FALSE	TRUE	TRUE	FALSE	1308
TRUE	TRUE	TRUE	TRUE	TRUE	3177298
FALSE	FALSE	FALSE	TRUE	FALSE	1
FALSE	FALSE	TRUE	TRUE	FALSE	1
FALSE	TRUE	TRUE	TRUE	TRUE	2876
TRUE	FALSE	FALSE	TRUE	FALSE	2924
TRUE	FALSE	TRUE	TRUE	FALSE	1237
TRUE	TRUE	TRUE	TRUE	TRUE	3774778

The parsing_report provides information on the validity of reads in each FASTQ file.

We can aggregate and visualize the global statititics for all reports with an upset plot:

parsing_report %>%
  bind_rows %>%
  group_by(across(starts_with("is_valid_"))) %>%
  summarise(count = sum(count), .groups = "drop") %>%
  reads_parse_report_ggupset(angle = 45, margin=0.3)

Quantification

Mapping

Reads are now ready-to-map on the reference genome using the map_fastq_files function.

Note: here, we use the function that allow the mapping of multiple FASTQ files in parallel which is a wrapper that calls the map_fastq function for each of the FASTQ files provided.

bf <- map_fastq_files(
  list_fastq_files = list.files("../extdata/TSS-EMOTE_Prados/parse_results", 
                                pattern = "_valid.fastq.gz$", 
                                full.names = TRUE),
  index_name = "../extdata/TSS-EMOTE_Prados/MW2.sr", # Path of the genome index (created if not exists)
  fasta_name = "../extdata/TSS-EMOTE_Prados/MW2.fasta", # Path of the reference genome file
  bam_dir = "../extdata/TSS-EMOTE_Prados/mapping_results_subread", # Output directory
  max_map_mismatch = 1,
  aligner = "subread", # either bowtie or subread
  map_unique = FALSE,
  force = FALSE,
  threads = 3)
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
bf
#> [[1]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam"
#> 
#> [[2]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam"
#> 
#> [[3]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam"
#> 
#> [[4]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam"

Mapping statistics

Use the map_stats_bam_files function to get some basic statistics on the mapping we’ve just performed:

map_stats_bam_files(bam_files = bf) %>%
  knitr::kable()
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute

bam_file	mapped	unmapped	pc_mapped
SRR3994381_valid.fastq.gz.bam	2001012	3188843	38.56
SRR3994382_valid.fastq.gz.bam	3436440	5124612	40.14
SRR3994389_valid.fastq.gz.bam	781940	2395358	24.61
SRR3994390_valid.fastq.gz.bam	1433740	2341038	37.98

Quantification (counts computation)

In order to obtain counts per position, we use the quantify_from_bam_files function to parse previously obtained bam_files.

This function allows the quantification of either the 5’-ends (quantify_5prime) or the 3’-ends (quantify_3prime), in both cases, it is an exact position quantification.

For UMIs to be taken into account for the quantification, we provide the read_features table with the “features” parameter. The keep_UMI parameter is used to record the number of copy of each UMI by exact position. This level of information is not needed for this analysis and we use it as default (FALSE).

count_table <- quantify_from_bam_files(bam_files = bf,
                                       sample_ids = c("rep1-", "rep1+", "rep2-", "rep2+"),
                                       fun = quantify_5prime, # Either quantify_5prime or quantify_3prime
                                       features = rf,
                                       keep_UMI = FALSE)
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.csv
count_table %>% 
  head(15) %>%
  knitr::kable()

sample_id	rname	position	strand	reads	count
rep1-	NC_003923.1	313	+	2	2
rep1-	NC_003923.1	425	+	34	9
rep1-	NC_003923.1	535	+	11	1
rep1-	NC_003923.1	637	+	1	1
rep1-	NC_003923.1	689	+	1	1
rep1-	NC_003923.1	791	+	2	1
rep1-	NC_003923.1	868	+	2	2
rep1-	NC_003923.1	911	+	2	1
rep1-	NC_003923.1	918	+	6	1
rep1-	NC_003923.1	1009	+	3	1
rep1-	NC_003923.1	1166	+	2	1
rep1-	NC_003923.1	1309	+	10	1
rep1-	NC_003923.1	1310	+	3	1
rep1-	NC_003923.1	1357	-	1	1
rep1-	NC_003923.1	1890	+	1	1

The above count table will allow us to perform some downstream analysis, i.e. TSS identification.

Note: here, the count column corresponds to the number of different UMIs mapped at this exact position. This is why it is lower or equal to the reads values (due to PCR duplicates).

Downstream analysis: TSS identification

Design table

The identification of TSS requires the comparison of 5’P and 5’P+5’PPP samples with multiple replicates. We thus build a design table to identify which samples is what.

The content of the design table is free as it includes the colnames:

sample_id (same as in the count table)
pool (library preparation for sequencing either “mono” for 5’P or “total” for 5’P+5’PPP)
replicate (p-values are combined by replicate)

design_tb <- tibble(
  sample_id = c("rep1-", "rep1+", "rep2-", "rep2+"),
  count_of = c("fraction", "total", "fraction", "total" ),
  replicate = c(1, 1, 2, 2)
)
design_tb %>%
  knitr::kable()

sample_id	count_of	replicate
rep1-	fraction	1
rep1+	total	1
rep2-	fraction	2
rep2+	total	2

Identification of TSS

We finally use the betabinom_TSS function to identify TSSs with the count table and the design table freshly built as inputs.

Note: the filtering of data is made by the function (and not before) because it needs the total number of counts before filtering.

bbin <- TSS(count_table = count_table,
            design_table = design_tb,
            threshold = 5, # Number minimum reads allowed in the total sample
            combine_method = "fisher", # Method for the p-values combination
            fdr_method = "fdr" # Method for adjusting p-value (see p.adjust)
            )
bbin %>% 
  head(15) %>%
  knitr::kable()

rname	position	strand	pvalue	pvalue_adj
NC_003923.1	312	+	0.0017109	0.0071046
NC_003923.1	313	+	0.0000000	0.0000000
NC_003923.1	314	+	0.0000001	0.0000008
NC_003923.1	1642	+	0.0725677	0.1682600
NC_003923.1	2131	+	0.0001241	0.0006330
NC_003923.1	3649	+	0.0000009	0.0000058
NC_003923.1	3650	+	0.0000000	0.0000000
NC_003923.1	3652	+	0.0000001	0.0000006
NC_003923.1	4128	+	0.0025859	0.0103542
NC_003923.1	4159	-	0.0000194	0.0001102
NC_003923.1	4301	+	0.1746218	0.3408392
NC_003923.1	6734	+	0.0182064	0.0576161
NC_003923.1	10687	-	0.0000000	0.0000000
NC_003923.1	10855	+	0.0000004	0.0000030
NC_003923.1	10856	+	0.1536700	0.3042540

Number of TSSs identified with an adjusted p-value (FDR) < 0.01: 2677.

Clusters of TSS

As in the Prados et al. paper we can clusterise TSSs by regrouping those less than 5 nucleotides apart.

bbin <- bbin %>% 
  filter(pvalue_adj < 0.01) %>% 
  arrange(position) %>%
   mutate(cluster_id = cumsum(c(1, diff(position) >= 5)))
bbin %>%
  head(15) %>%
  knitr::kable()

rname	position	strand	pvalue	pvalue_adj	cluster_id
NC_003923.1	312	+	0.0017109	0.0071046	1
NC_003923.1	313	+	0.0000000	0.0000000	1
NC_003923.1	314	+	0.0000001	0.0000008	1
NC_003923.1	2131	+	0.0001241	0.0006330	2
NC_003923.1	3649	+	0.0000009	0.0000058	3
NC_003923.1	3650	+	0.0000000	0.0000000	3
NC_003923.1	3652	+	0.0000001	0.0000006	3
NC_003923.1	4159	-	0.0000194	0.0001102	4
NC_003923.1	10687	-	0.0000000	0.0000000	5
NC_003923.1	10855	+	0.0000004	0.0000030	6
NC_003923.1	12473	+	0.0000000	0.0000000	7
NC_003923.1	14672	+	0.0000634	0.0003335	8
NC_003923.1	15797	+	0.0000000	0.0000000	9
NC_003923.1	17356	+	0.0000000	0.0000000	10
NC_003923.1	17357	+	0.0000000	0.0000000	10

Number of cluster of TSSs identified with an adjusted p-value (fdr) < 0.01: 1923.