---
title: "TSS identification using 5'-end RNA-Seq data"
output: 
  html_document:
    keep_md: true
vignette: >
  %\VignetteIndexEntry{TSS_identification}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
editor_options: 
  chunk_output_type: inline
---


# TSS identification using 5'-end RNA-Seq data

Complete source of this page is available in the notebook [TSS_identification.Rmd](https://gitlab.com/rnaends/rnaends/-/blob/main/vignettes/TSS_identification.Rmd).


``` r
library(tidyverse)
```


``` r
library(rnaends)
```

Based on the publication from Prados J, Linder P, Redder P (2016) TSS-EMOTE, a refined protocol for a more complete and less biased global mapping of transcription start sites in bacterial pathogens. BMC Genomics, 17, 849. [https://doi.org/10.1186/s12864-016-3211-3](https://doi.org/10.1186/s12864-016-3211-3).

The library preparation protocol is as follows. Briefly, in this protocol, total RNA is purified and digested with a 5’-exonuclease which removes the vast majority of 5’ mono-phosphorylated RNAs. The RNA is mixed with a synthetic RNA oligo (specific adapter) and split into two pools. In the first pool (noted +RppH), both a ligase and an RNA 5’ pyrophosphohydrolase RppH enzymes are added, while in the second pool (labeled -RppH) only the ligase is added. RppH converts the 5’PPP to 5’P that can be used by the ligase to add the adapter oligo to the RNA sequences (for detailed protocol see ([Prados et al., 2016](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3211-3))). The “-RppH” pool serves as control to measure the level of background ligation and avoid false positive TSS detection. Thus, a given genome position will be predicted as a TSS if the number of reads from the “+RppH” pool mapping at that position is significantly higher than the number of reads from the “-RppH” pool by applying a beta-binomial test as described in (Prados et al., 2016)

## Download datasets

Data from this article are available on Gene Expression Omnibus (GEO) database: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85110

For this notebook, only 4 samples will be used that correspond to 5'-end RNA-Seq data from *Staphylococcus aureus* (MW2 strain) grown on RPMI liquid at 37°C (GSM2257851, GSM2257852, GSM2257859, GSM2257860).

We download data directly in a the *data* directory of the GitLab project in the sub-directory called *TSS-EMOTE_Prados* where the reference genome is already provided when the project is cloned. 


``` r
options(timeout = 3600) # for download time, set more if timeout occurs

list_SRR <- c("SRR3994381", "SRR3994382", "SRR3994390", "SRR3994389")

for (i in list_SRR){
    raw_fastq <- paste0("../extdata/TSS-EMOTE_Prados/", i, ".fastq.gz")
    if (!file.exists(raw_fastq))
        download.file(paste0("https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=",i),
                      destfile = raw_fastq)
}
```

## Pre-processing of 5'-end RNA-Seq reads

### Parse reads (+ read features)

Reads are not ready-to-map yet since they still have an adapter (the RP6 oligonucleotide) located at the 5'-end of each read.

We use `parse_reads`, in order to check that each read contains the expected elements in the adapter, the remove this oligonucleotide from the reads (see RP6 oligo TSS-EMOTE). 

`parse_reads` requires a `read_features` table which defines the expected elements on the read. 

Following the Materials & Methods section of the article, the structure of the reads can be schematized as follows (X is the mappable part):
```
      1       2            5      6            20     21    27      28 30     31                  50
      N     - (EMOTE barcode) -   CGGCACCAACCGAGG    - VVVVVVV -     CGC     - XXXXXXXXXXXXXXXXXXXXX
1 random nt -  EMOTE barcode  -   recognition.seq    -   UMI   - control.seq -     RNA 5'-end
```

We use `init_read_features` and `add_read_feature` functions to create the `read_features` table following the above structure.

The initialization of the `read_features` table is made by `init_read_features` since the only required feature is the *readseq* feature (the mappable part to keep in the FASTQ file).

According to our scheme, we know that the *readseq* part begins at location 31 and are 20 nucleotides long (illumina 50bp).

``` r
rf <- init_read_features(start = 31, width = 20)
```

We add other features such as the *barcode* which begins at position 2 and are 4 bases long with the following specifications:

- *pattern* : enables to specify which barcode sequence(s) are allowed, here we allowed 4 different sequences
- *pattern_type*: is used to know which type of pattern it is, "constant" means that we check the exact *pattern* sequences allowing *N* number of mismatches (specified by the *max_mismatch* parameter)
- *readid_prepend*: is used to store the retrieved feature in the read identifier. It will be useful to keep a track of features since they are removed from the reads and therefore absent from the FASTQ output file.


``` r
rf <- add_read_feature(rf,
                       name = "barcode",
                       start = 2,
                       width = 4,
                       pattern = c("TCGG","CAAG","CGTC","AAGT"),
                       pattern_type = "constant",
                       max_mismatch = 0,
                       readid_prepend = FALSE)
```

The *recognition sequence* feature begins at position 6 and is 15 bases long, and we expect a single unique sequence with maximum 1 mismatch:

``` r
rf <- add_read_feature(rf,
                       name = "RS",
                       start = 6,
                       width = 15,
                       pattern = "CGGCACCAACCGAGG",
                       pattern_type = "constant",
                       max_mismatch = 1,
                       readid_prepend = FALSE)
```

The *UMI* feature begins at position 21 and is 7 bases long. Its pattern type is different from the previous ones because: as UMIs are random sequences, we do not expect a particular sequence (nor list of sequences) but a sequence of random characters (A, C or G) hence the *patern_type = "nucleotides"*. We set *readid_prepend = TRUE* because we will need the UMIs to remove PCR duplicates introduced by the library preparation protocol.

**Note:** the *readseq* feature has the same *pattern_type* than UMI and is set by default to A,C,T or G.


``` r
rf <- add_read_feature(rf,
                       name = "UMI",
                       start = 21,
                       width = 7,
                       pattern = "ACG",
                       pattern_type = "nucleotides",
                       max_mismatch = 0,
                       readid_prepend = TRUE)
```

The *control sequence* feature begins at position 28 and is 3 bases long, and we expect a specific and unique sequence:

``` r
rf <- add_read_feature(rf,
                       name = "CS",
                       start = 28,
                       width = 3,
                       pattern = "CGC",
                       pattern_type = "constant",
                       max_mismatch = 0,
                       readid_prepend = FALSE)
```

The resulting *read_features* table is:

``` r
rf %>%
  knitr::kable()
```


|name    | start| width|pattern_type |pattern      | max_mismatch|readid_prepend |
|:-------|-----:|-----:|:------------|:------------|------------:|:--------------|
|readseq |    31|    20|nucleotides  |ACTG         |            0|FALSE          |
|barcode |     2|     4|constant     |TCGG, CA.... |            0|FALSE          |
|RS      |     6|    15|constant     |CGGCACCA.... |            1|FALSE          |
|UMI     |    21|     7|nucleotides  |ACG          |            0|TRUE           |
|CS      |    28|     3|constant     |CGC          |            0|FALSE          |


Now that we have built the *read_features* table, we can parse the 4 FASTQ files with the `parse_reads` function.

**Note:** `mclapply` function from the `parallel` package is used to perform all the 4 processing in parallel but `lapply` can be used instead.


``` r
parsing_report <- lapply(
    list.files("../extdata/TSS-EMOTE_Prados/", pattern = ".fastq.gz$", full.names = TRUE),
    FUN = parse_reads,
    features = rf, # The read_features table
    keep_invalid_reads = FALSE, # do not store invalid reads
    out_dir = "../extdata/TSS-EMOTE_Prados/parse_results", # Path of the out directory (optional)
    force = FALSE
)
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994381_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994382_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994389_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/parse_results/SRR3994390_parse_report.csv
```


``` r
parsing_report %>%
  bind_rows %>%
  knitr::kable()
```


|is_valid_readseq |is_valid_barcode |is_valid_RS |is_valid_UMI |is_valid_CS |   count|
|:----------------|:----------------|:-----------|:------------|:-----------|-------:|
|FALSE            |FALSE            |TRUE        |TRUE         |FALSE       |       1|
|FALSE            |TRUE             |TRUE        |TRUE         |TRUE        |    1444|
|TRUE             |FALSE            |FALSE       |TRUE         |FALSE       |     272|
|TRUE             |FALSE            |TRUE        |TRUE         |FALSE       |    2668|
|TRUE             |TRUE             |TRUE        |TRUE         |TRUE        | 5189855|
|FALSE            |FALSE            |FALSE       |TRUE         |FALSE       |       3|
|FALSE            |FALSE            |TRUE        |TRUE         |FALSE       |       3|
|FALSE            |TRUE             |TRUE        |TRUE         |TRUE        |    2565|
|TRUE             |FALSE            |FALSE       |TRUE         |FALSE       |    3166|
|TRUE             |FALSE            |TRUE        |TRUE         |FALSE       |   21274|
|TRUE             |TRUE             |TRUE        |TRUE         |FALSE       |    1320|
|TRUE             |TRUE             |TRUE        |TRUE         |TRUE        | 8561052|
|FALSE            |FALSE            |FALSE       |TRUE         |FALSE       |       4|
|FALSE            |TRUE             |TRUE        |TRUE         |TRUE        |    2378|
|TRUE             |FALSE            |FALSE       |TRUE         |FALSE       |    2789|
|TRUE             |FALSE            |TRUE        |TRUE         |FALSE       |    1308|
|TRUE             |TRUE             |TRUE        |TRUE         |TRUE        | 3177298|
|FALSE            |FALSE            |FALSE       |TRUE         |FALSE       |       1|
|FALSE            |FALSE            |TRUE        |TRUE         |FALSE       |       1|
|FALSE            |TRUE             |TRUE        |TRUE         |TRUE        |    2876|
|TRUE             |FALSE            |FALSE       |TRUE         |FALSE       |    2924|
|TRUE             |FALSE            |TRUE        |TRUE         |FALSE       |    1237|
|TRUE             |TRUE             |TRUE        |TRUE         |TRUE        | 3774778|


The parsing_report provides information on the validity of reads in each FASTQ file.


We can aggregate and visualize the global statititics for all reports with an upset plot:

``` r
parsing_report %>%
  bind_rows %>%
  group_by(across(starts_with("is_valid_"))) %>%
  summarise(count = sum(count), .groups = "drop") %>%
  reads_parse_report_ggupset(angle = 45, margin=0.3)
```

![](../../vignettes/TSS_identification_files/figure-html/unnamed-chunk-12-1.png)<!-- -->


## Quantification

### Mapping

Reads are now ready-to-map on the reference genome using the `map_fastq_files` function.

**Note:** here, we use the function that allow the mapping of multiple FASTQ files in parallel which is a wrapper that calls the `map_fastq` function for each of the FASTQ files provided.


``` r
bf <- map_fastq_files(
  list_fastq_files = list.files("../extdata/TSS-EMOTE_Prados/parse_results", 
                                pattern = "_valid.fastq.gz$", 
                                full.names = TRUE),
  index_name = "../extdata/TSS-EMOTE_Prados/MW2.sr", # Path of the genome index (created if not exists)
  fasta_name = "../extdata/TSS-EMOTE_Prados/MW2.fasta", # Path of the reference genome file
  bam_dir = "../extdata/TSS-EMOTE_Prados/mapping_results_subread", # Output directory
  max_map_mismatch = 1,
  aligner = "subread", # either bowtie or subread
  map_unique = FALSE,
  force = FALSE,
  threads = 3)
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
bf
#> [[1]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam"
#> 
#> [[2]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam"
#> 
#> [[3]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam"
#> 
#> [[4]]
#> [1] "../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam"
```

#### Mapping statistics

Use the `map_stats_bam_files` function to get some basic statistics on the mapping we've just performed:

``` r
map_stats_bam_files(bam_files = bf) %>%
  knitr::kable()
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
```


|bam_file                      |  mapped| unmapped| pc_mapped|
|:-----------------------------|-------:|--------:|---------:|
|SRR3994381_valid.fastq.gz.bam | 2001012|  3188843|     38.56|
|SRR3994382_valid.fastq.gz.bam | 3436440|  5124612|     40.14|
|SRR3994389_valid.fastq.gz.bam |  781940|  2395358|     24.61|
|SRR3994390_valid.fastq.gz.bam | 1433740|  2341038|     37.98|


### Quantification (counts computation)

In order to obtain counts per position, we use the `quantify_from_bam_files` function to parse previously obtained `bam_files`.

This function allows the quantification of either the 5'-ends (`quantify_5prime`) or the 3'-ends (`quantify_3prime`), in both cases, it is an exact position quantification.

For UMIs to be taken into account for the quantification, we provide the `read_features` table with the *"features"* parameter. 
The *keep_UMI* parameter is used to record the number of copy of each UMI by exact position. This level of information is not needed for this analysis and we use it as default (FALSE). 


``` r
count_table <- quantify_from_bam_files(bam_files = bf,
                                       sample_ids = c("rep1-", "rep1+", "rep2-", "rep2+"),
                                       fun = quantify_5prime, # Either quantify_5prime or quantify_3prime
                                       features = rf,
                                       keep_UMI = FALSE)
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994381_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994382_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994389_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/TSS-EMOTE_Prados/mapping_results_subread/SRR3994390_valid.fastq.gz.bam.csv
count_table %>% 
  head(15) %>%
  knitr::kable()
```


|sample_id |rname       | position|strand | reads| count|
|:---------|:-----------|--------:|:------|-----:|-----:|
|rep1-     |NC_003923.1 |      313|+      |     2|     2|
|rep1-     |NC_003923.1 |      425|+      |    34|     9|
|rep1-     |NC_003923.1 |      535|+      |    11|     1|
|rep1-     |NC_003923.1 |      637|+      |     1|     1|
|rep1-     |NC_003923.1 |      689|+      |     1|     1|
|rep1-     |NC_003923.1 |      791|+      |     2|     1|
|rep1-     |NC_003923.1 |      868|+      |     2|     2|
|rep1-     |NC_003923.1 |      911|+      |     2|     1|
|rep1-     |NC_003923.1 |      918|+      |     6|     1|
|rep1-     |NC_003923.1 |     1009|+      |     3|     1|
|rep1-     |NC_003923.1 |     1166|+      |     2|     1|
|rep1-     |NC_003923.1 |     1309|+      |    10|     1|
|rep1-     |NC_003923.1 |     1310|+      |     3|     1|
|rep1-     |NC_003923.1 |     1357|-      |     1|     1|
|rep1-     |NC_003923.1 |     1890|+      |     1|     1|


The above *count table* will allow us to perform some downstream analysis, *i.e.* TSS identification.

**Note:** here, the *count* column corresponds to the number of different UMIs mapped at this exact position. This is why it is lower or equal to the *reads* values (due to PCR duplicates).

## Downstream analysis: TSS identification

### Design table

The identification of TSS requires the comparison of 5'P and 5'P+5'PPP samples with multiple replicates. We thus build a `design table` to identify which samples is what.

The content of the design table is free as it includes the colnames:

- *sample_id* (same as in the count table) 
- *pool* (library preparation for sequencing either "*mono*" for 5'P or "*total*" for 5'P+5'PPP)
- *replicate* (p-values are combined by replicate) <br>


``` r
design_tb <- tibble(
  sample_id = c("rep1-", "rep1+", "rep2-", "rep2+"),
  count_of = c("fraction", "total", "fraction", "total" ),
  replicate = c(1, 1, 2, 2)
)
design_tb %>%
  knitr::kable()
```


|sample_id |count_of | replicate|
|:---------|:--------|---------:|
|rep1-     |fraction |         1|
|rep1+     |total    |         1|
|rep2-     |fraction |         2|
|rep2+     |total    |         2|


## Identification of TSS 

We finally use the `betabinom_TSS` function to identify TSSs with the *count table* and the *design table* freshly built as inputs.

**Note:** the filtering of data is made by the function (and not before) because it needs the total number of counts before filtering.


``` r
bbin <- TSS(count_table = count_table,
            design_table = design_tb,
            threshold = 5, # Number minimum reads allowed in the total sample
            combine_method = "fisher", # Method for the p-values combination
            fdr_method = "fdr" # Method for adjusting p-value (see p.adjust)
            )
bbin %>% 
  head(15) %>%
  knitr::kable()
```


|rname       | position|strand |    pvalue| pvalue_adj|
|:-----------|--------:|:------|---------:|----------:|
|NC_003923.1 |      312|+      | 0.0017109|  0.0071046|
|NC_003923.1 |      313|+      | 0.0000000|  0.0000000|
|NC_003923.1 |      314|+      | 0.0000001|  0.0000008|
|NC_003923.1 |     1642|+      | 0.0725677|  0.1682600|
|NC_003923.1 |     2131|+      | 0.0001241|  0.0006330|
|NC_003923.1 |     3649|+      | 0.0000009|  0.0000058|
|NC_003923.1 |     3650|+      | 0.0000000|  0.0000000|
|NC_003923.1 |     3652|+      | 0.0000001|  0.0000006|
|NC_003923.1 |     4128|+      | 0.0025859|  0.0103542|
|NC_003923.1 |     4159|-      | 0.0000194|  0.0001102|
|NC_003923.1 |     4301|+      | 0.1746218|  0.3408392|
|NC_003923.1 |     6734|+      | 0.0182064|  0.0576161|
|NC_003923.1 |    10687|-      | 0.0000000|  0.0000000|
|NC_003923.1 |    10855|+      | 0.0000004|  0.0000030|
|NC_003923.1 |    10856|+      | 0.1536700|  0.3042540|


Number of TSSs identified with an adjusted p-value (FDR) < 0.01: 2677.


### Clusters of TSS

As in the *Prados et al.* paper we can clusterise TSSs by regrouping those less than 5 nucleotides apart.


``` r
bbin <- bbin %>% 
  filter(pvalue_adj < 0.01) %>% 
  arrange(position) %>%
   mutate(cluster_id = cumsum(c(1, diff(position) >= 5)))
bbin %>%
  head(15) %>%
  knitr::kable()
```


|rname       | position|strand |    pvalue| pvalue_adj| cluster_id|
|:-----------|--------:|:------|---------:|----------:|----------:|
|NC_003923.1 |      312|+      | 0.0017109|  0.0071046|          1|
|NC_003923.1 |      313|+      | 0.0000000|  0.0000000|          1|
|NC_003923.1 |      314|+      | 0.0000001|  0.0000008|          1|
|NC_003923.1 |     2131|+      | 0.0001241|  0.0006330|          2|
|NC_003923.1 |     3649|+      | 0.0000009|  0.0000058|          3|
|NC_003923.1 |     3650|+      | 0.0000000|  0.0000000|          3|
|NC_003923.1 |     3652|+      | 0.0000001|  0.0000006|          3|
|NC_003923.1 |     4159|-      | 0.0000194|  0.0001102|          4|
|NC_003923.1 |    10687|-      | 0.0000000|  0.0000000|          5|
|NC_003923.1 |    10855|+      | 0.0000004|  0.0000030|          6|
|NC_003923.1 |    12473|+      | 0.0000000|  0.0000000|          7|
|NC_003923.1 |    14672|+      | 0.0000634|  0.0003335|          8|
|NC_003923.1 |    15797|+      | 0.0000000|  0.0000000|          9|
|NC_003923.1 |    17356|+      | 0.0000000|  0.0000000|         10|
|NC_003923.1 |    17357|+      | 0.0000000|  0.0000000|         10|


Number of cluster of TSSs identified with an adjusted p-value (fdr) < 0.01: 1923.