---
title: "Co-translational exoribonucleolytic degradation profiling in A. thaliana using 5' end RNA-Seq data"
output: 
  html_document:
    keep_md: true
vignette: >
  %\VignetteIndexEntry{Degradation_profiling_At}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---


# Co-translational exoribonucleolytic degradation profiling in *A. thaliana* using 5' end RNA-Seq data

Complete source of this page is available in the notebook [Degradation_profiling_At.Rmd](https://gitlab.com/rnaends/rnaends/-/blob/main/vignettes/Degradation_profiling_At.Rmd).


``` r
library(tidyverse)
time_start <- now()
library(ggpubr)
```


``` r
library(rnaends)
```

## Co-translational exoribonucleolytic degradation analysis using 5’-end RNA-Seq data

Based on the publication from Cheng-Yu Hou, Wen-Chi Lee, Hsiao-Chun Chou, Ai-Ping Chen, Shu-Jen Chou, and Ho-Ming Chen (2016) Global Analysis of Truncated RNA Ends Reveals New Insights into Ribosome Stalling in Plants, Plant Cell, [https://doi.org/10.1105/tpc.16.00295](https://doi.org/10.1105/tpc.16.00295).

## Download datasets

Data generated from this article are available on Gene Expression Omnibus (GEO) database: [GSE77549](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77549)

For this notebook, we will only use one sample corresonding to inflorescences of wild-type Arabidopsis from which total RNA was extacted and isolated for PARE library construction and sequencing ([GSM2253889](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2253889)). 
The raw sequencing data are downloaded in the "extdata" directory of the GitLab project in the subdirectory "Athaliana_Hou" using the NCBI `fasterq-dump` tool from NCBI:


``` bash
fasterq-dump SRR3143655 --outdir ../extdata/Athaliana_Hou
```


<!--

Plant cell: https://academic.oup.com/plcell/article/28/10/2398/6098378

BioProject: 

GSE: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77549

GSM: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2253889

SRX: https://www.ncbi.nlm.nih.gov/sra?term=SRX1559792

SRR: https://trace.ncbi.nlm.nih.gov/Traces?run=SRR3143655

-->


## Pre-processing

### Parse reads (+ read features)

In order to make reads mappable, we must only keep the first 20 nucleotides as mentionned in the Material&Methods of the original article.

This is done with the `parse_read` function that validates the sequence composition (only A, T, C and G), and extracts the mappable part into a new FASTQ file for mapping to the reference sequences.

`parse_reads` utilizes a `read_features` table that describes each element that should be found in the reads. In this case, there is only one read feature: the first 20 nucleotides of the reads.


``` r
rf <- init_read_features(start = 1, width = 20)
rf %>%
  knitr::kable()
```


|name    | start| width|pattern_type |pattern | max_mismatch|readid_prepend |
|:-------|-----:|-----:|:------------|:-------|------------:|:--------------|
|readseq |     1|    20|nucleotides  |ACTG    |            0|FALSE          |


Now, we can use the `parse_reads` function on the raw reads FASTQ file using the `read_features` table as parameter:


``` r
parse_reads("../extdata/Athaliana_Hou/SRR3143655.fastq.gz",
            yieldsize = 1e6, # Number of reads parsed at each iteration
            features = rf, # The features table
            out_dir = "../extdata/Athaliana_Hou/parsed_reads",
            keep_invalid_reads = FALSE, # We don't want to store invalid reads
            force = FALSE,
) %>%
  knitr::kable()
```


|is_valid_readseq |    count|
|:----------------|--------:|
|FALSE            |    35864|
|TRUE             | 56567785|


We observe that, from the 56,603,649 total reads, only 35,864 were discarded, and we are left with more than 56 millions reads that can be used further.

## Quantification

Reads are now ready to be aligned on the reference sequences using the `map_fastq` function.

### Reference sequences


For exoribonucleolytic degradation analysis, and particularly for periodicity analysis, we need to map the read sequences to protein coding sequences. Moreover, to study the 5'ends of transcripts around the START and STOP codon, we also need the 5'UTR and the 3'UTR.

To perform this, we will use the data from the [TAIR10 database](https://www.arabidopsis.org/), and more precisely the following files avaliable for [download](https://www.arabidopsis.org/download/list?dir=Sequences%2FTAIR10_blastsets):

  - [TAIR10_cdna_20101214](https://www.arabidopsis.org/download/file?path=Sequences/TAIR10_blastsets/TAIR10_cdna_20101214): all TAIR10 cDNA sequences
  - [TAIR10_5_utr_20101028](https://www.arabidopsis.org/download/file?path=Sequences/TAIR10_blastsets/TAIR10_5_utr_20101028) and [TAIR10_3_utr_20101028](https://www.arabidopsis.org/download/file?path=Sequences/TAIR10_blastsets/TAIR10_3_utr_20101028): 5' and 3' UTR of each gene.

These files are availabe (gzipped) in the `extdata/Athaliana_Hou/reference_sequences` directory of the GitLab project.


## Mapping to reference sequences

The `map_fastq` function is called with the reads preprocessed previously in `SRR3143655_valid.fastq.gz` and with the reference sequences (cDNAs). The `bowtie` aligner is specified. For the mapping, we specify that one mixmatch is allowed in the alignments, and that reads should map to a unique location (`map_unique = TRUE`).


``` r
bf <- map_fastq(
  "../extdata/Athaliana_Hou/parsed_reads/SRR3143655_valid.fastq.gz", 
  index_name = "../extdata/Athaliana_Hou/reference_sequences/TAIR10_cdna_20101214.bt", # Path of the genome index (created if not exists)
  fasta_name = "../extdata/Athaliana_Hou/reference_sequences/TAIR10_cdna_20101214.gz", # Path of the reference genome file
  bam_dir = "../extdata/Athaliana_Hou/mapping_results", # Output directory
  max_map_mismatch = 1,
  aligner = "bowtie", # Either bowtie or subread
  map_unique = TRUE,
  force = FALSE,
  threads = 3
)
#> Warning in FUN(X[[i]], ...): Index directory already exists remove it if you
#> want to recreate it
```


## Quantification

In order to obtain counts by position, we use the `quantify_from_bam_files` function to parse previously obtained BAM file in `bf`.

This function allows the quantification of either the 5’-ends (`quantify_5prime`) or the 3’-ends (`quantify_3prime`), in both cases, it is an exact position quantification.


``` r
count_table <- quantify_from_bam_files(
    bam_files = bf,
    sample_ids = c("inflorescence"),
    fun = quantify_5prime, # Either quantify_5prime or quantify_3prime
    features = rf,
    force = FALSE) 
count_table %>% 
  head(15) %>%
  knitr::kable()
```


|sample_id     |rname       | position|strand | count|
|:-------------|:-----------|--------:|:------|-----:|
|inflorescence |AT1G50920.1 |        1|+      |     1|
|inflorescence |AT1G50920.1 |        2|+      |     2|
|inflorescence |AT1G50920.1 |        7|+      |     1|
|inflorescence |AT1G50920.1 |        8|+      |     2|
|inflorescence |AT1G50920.1 |       12|+      |     2|
|inflorescence |AT1G50920.1 |       22|+      |     1|
|inflorescence |AT1G50920.1 |       27|+      |     1|
|inflorescence |AT1G50920.1 |       30|+      |     1|
|inflorescence |AT1G50920.1 |       36|+      |     1|
|inflorescence |AT1G50920.1 |       37|+      |     1|
|inflorescence |AT1G50920.1 |       46|+      |     1|
|inflorescence |AT1G50920.1 |       47|+      |     1|
|inflorescence |AT1G50920.1 |       59|+      |     1|
|inflorescence |AT1G50920.1 |       60|+      |     1|
|inflorescence |AT1G50920.1 |       68|+      |     1|


The above `count table` will allow us to perform some downstream analysis.

## Downstream analysis

## Annotation table preparation

In order to retrieve the number of reads mapped on the CDSs, we need to have access to the position (start and stop) of these CDSs.

We will use the BioStrings library to load the TAIR10 FASTQ files, and determine the lengths of the 5' and 3' UTRs of the cDNAs.


``` r
library(Biostrings)
```

Load all reference sequences (cDNAs) from FASTA:

``` r
cdna_ref <- readDNAStringSet("../extdata/Athaliana_Hou/reference_sequences/TAIR10_cdna_20101214.gz")
```

Extract names from FASTA descriptions:

``` r
cdna <- cdna_ref %>%
  names %>%
  tibble(fa=.) %>%
  separate(fa, sep=" \\| ", into=c("rname", "symbols", "desc", "loc")) %>%
  mutate(len = str_extract(loc, "LENGTH=\\d+"),
         len = as.numeric(str_extract(len, "\\d+$")))
```


Then, we use 5' and 3' UTR FASTA files to determine their length, and thus the location of START and STOP codons on cDNAs:

``` r
utr5 <- readDNAStringSet("../extdata/Athaliana_Hou/reference_sequences/TAIR10_5_utr_20101028.gz")
utr5 <- utr5 %>%
  names %>%
  tibble(fa=.) %>%
  separate(fa, sep=" \\| ", into=c("rname", "ranges", "loc")) %>%
  mutate(utr5len = str_extract(loc, "LENGTH=\\d+"),
         utr5len = as.numeric(str_extract(utr5len, "\\d+$"))) %>%
  select(rname, utr5len)
```

Same on the 3'UTR:

``` r
utr3 <- readDNAStringSet("../extdata/Athaliana_Hou/reference_sequences/TAIR10_3_utr_20101028.gz")
utr3 <- utr3 %>%
  names %>%
  tibble(fa=.) %>%
  separate(fa, sep=" \\| ", into=c("rname", "ranges", "loc")) %>%
  mutate(utr3len = str_extract(loc, "LENGTH=\\d+"),
         utr3len = as.numeric(str_extract(utr3len, "\\d+$"))) %>%
  select(rname, utr3len)
```

We merge all the information in the `cdna` tibble:

``` r
cdna <- cdna %>%
  left_join(utr5, by = join_by(rname)) %>%
  left_join(utr3, by = join_by(rname)) %>%
  replace_na(list(utr5len = 0, utr3len=0))
```

And, we reformat these data into the `annot` table that will be used for analysing regions around features (START and STOP codons), and periodicity on the CDSs.

``` r
annot <- tibble(
  rname = cdna$rname,
  feature = "CDS",
  strand = "+",
  start = 1 + cdna$utr5len,
  end = cdna$len   - cdna$utr3len,
  annotation = names(cdna_ref)
)
annot %>%
  head(15) %>%
  knitr::kable()
```


|rname       |feature |strand | start|  end|annotation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|:-----------|:-------|:------|-----:|----:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|AT1G51370.2 |CDS     |+      |     1| 1041|AT1G51370.2 &#124; Symbols:  &#124; F-box/RNI-like/FBD-like domains-containing protein &#124; chr1:19045615-19046825 FORWARD LENGTH=1118                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|AT1G50920.1 |CDS     |+      |   119| 2134|AT1G50920.1 &#124; Symbols:  &#124; Nucleolar GTP-binding protein &#124; chr1:18870139-18872930 FORWARD LENGTH=2394                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|AT1G36960.1 |CDS     |+      |     1|  546|AT1G36960.1 &#124; Symbols:  &#124; unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT1G48095.1); Has 54 Blast hits to 54 proteins in 2 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 54; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink). &#124; chr1:14014796-14015508 FORWARD LENGTH=546                                                                                                                                                                                                                                               |
|AT1G44020.1 |CDS     |+      |     1| 1734|AT1G44020.1 &#124; Symbols:  &#124; Cysteine/Histidine-rich C1 domain family protein &#124; chr1:16716112-16718656 REVERSE LENGTH=2314                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|AT1G15970.1 |CDS     |+      |   375| 1433|AT1G15970.1 &#124; Symbols:  &#124; DNA glycosylase superfamily protein &#124; chr1:5486319-5488868 REVERSE LENGTH=1658                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|AT1G73440.1 |CDS     |+      |    56|  820|AT1G73440.1 &#124; Symbols:  &#124; calmodulin-related &#124; chr1:27611363-27612815 FORWARD LENGTH=1453                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|AT1G75120.1 |CDS     |+      |   147| 1355|AT1G75120.1 &#124; Symbols: RRA1 &#124; Nucleotide-diphospho-sugar transferase family protein &#124; chr1:28196857-28198802 REVERSE LENGTH=1520                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|AT1G17600.1 |CDS     |+      |    93| 3242|AT1G17600.1 &#124; Symbols:  &#124; Disease resistance protein (TIR-NBS-LRR class) family &#124; chr1:6052936-6056664 REVERSE LENGTH=3332                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|AT1G50860.1 |CDS     |+      |     1| 3846|AT1G50860.1 &#124; Symbols:  &#124; transposable element gene &#124; chr1:18850511-18854356 REVERSE LENGTH=3846                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|AT1G51380.1 |CDS     |+      |    79| 1257|AT1G51380.1 &#124; Symbols:  &#124; DEA(D/H)-box RNA helicase family protein &#124; chr1:19047882-19050162 FORWARD LENGTH=1452                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|AT1G77370.1 |CDS     |+      |    74|  466|AT1G77370.1 &#124; Symbols:  &#124; Glutaredoxin family protein &#124; chr1:29073843-29074796 FORWARD LENGTH=620                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|AT1G44090.1 |CDS     |+      |     1| 1158|AT1G44090.1 &#124; Symbols: ATGA20OX5, GA20OX5 &#124; gibberellin 20-oxidase 5 &#124; chr1:16760677-16762486 REVERSE LENGTH=1158                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|AT1G10950.1 |CDS     |+      |   114| 1883|AT1G10950.1 &#124; Symbols: TMN1, AtTMN1 &#124; transmembrane nine 1 &#124; chr1:3659209-3663984 FORWARD LENGTH=2245                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|AT1G31870.1 |CDS     |+      |   161| 1846|AT1G31870.1 &#124; Symbols:  &#124; unknown protein; FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: cytosol, nucleus; EXPRESSED IN: 25 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein of unknown function DUF2050, pre-mRNA-splicing factor (InterPro:IPR018609); Has 13303 Blast hits to 9618 proteins in 489 species: Archae - 8; Bacteria - 444; Metazoa - 7153; Fungi - 2294; Plants - 942; Viruses - 128; Other Eukaryotes - 2334 (source: NCBI BLink). &#124; chr1:11436747-11440109 FORWARD LENGTH=2134 |
|AT1G51360.1 |CDS     |+      |    25|  657|AT1G51360.1 &#124; Symbols: ATDABB1, DABB1 &#124; dimeric A/B barrel domainS-protein 1 &#124; chr1:19044184-19044900 FORWARD LENGTH=717                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |


## Metagene counts


Now that we have the count table and the annotation table, we can compute the metacount profiles around the first nucleotide of the START or STOP codon using the `metacounts_feature` function.

First, we extract the regions spanning around the feature (START or STOP codon) with `metacounts_feature`:


``` r
start_tb <- annot %>%
  select(rname, strand, position=start)
start_metacounts <- metacounts_feature(count_table, start_tb, width=100) 
start_metacounts %>% 
  head(15) %>%
  knitr::kable()
```


|    x|     count|
|----:|---------:|
| -100| 0.0057777|
|  -99| 0.0058652|
|  -98| 0.0058652|
|  -97| 0.0060961|
|  -96| 0.0060172|
|  -95| 0.0056927|
|  -94| 0.0056102|
|  -93| 0.0056927|
|  -92| 0.0062604|
|  -91| 0.0057777|
|  -90| 0.0059554|
|  -89| 0.0055860|
|  -88| 0.0057777|
|  -87| 0.0059554|
|  -86| 0.0054522|


Then, we can visualize the metacounts profile:

``` r
start_metacounts %>% 
  plot(from = -30, to = 50,
       translation_frame = TRUE,
       geom = "bar",
       x_lab = "Distance to the first nucleotide of start codon")
```

![](../../vignettes/Degradation_profiling_At_files/figure-html/plot_around_start_At-1.png)<!-- -->


We perform the same for the STOP codon (adjusting for the position of the 1st nucleotide):


``` r
stop_tb <- annot %>%
  select(rname=rname, strand, position=end) %>%
  mutate(position = ifelse(strand=='+', position-2, position+2))
stop_metacounts <- metacounts_feature(count_table, stop_tb, width=100) 
stop_metacounts %>% 
  head(15) %>%
  knitr::kable()
```


|    x|     count|
|----:|---------:|
| -100| 0.0049187|
|  -99| 0.0050808|
|  -98| 0.0057159|
|  -97| 0.0050104|
|  -96| 0.0048753|
|  -95| 0.0055718|
|  -94| 0.0050231|
|  -93| 0.0050231|
|  -92| 0.0057656|
|  -91| 0.0050692|
|  -90| 0.0049882|
|  -89| 0.0055718|
|  -88| 0.0050734|
|  -87| 0.0050614|
|  -86| 0.0056914|


Metacount profile obtained:

``` r
p_stop_metacounts <- stop_metacounts %>% 
  plot(from = -60, to = 10,
       translation_frame = TRUE,
       geom = "bar",
       x_lab = "Distance to the first nucleotide of stop codon")
p_stop_metacounts
```

![](../../vignettes/Degradation_profiling_At_files/figure-html/plot_around_stop_At-1.png)<!-- -->


## Periodicity

With the annotation table `annot`, we can compute the metacounts profile for all the CDS regions, and use FFT to analyse periodicity in the counts.


``` r
cds_tb <- annot %>% 
  select(rname, strand, begin=start, end)
stop_tb <- annot %>%
  select(rname=rname, strand, position=end) %>%
  mutate(position = ifelse(strand=='+', position-2, position+2))
fft_tb <- FFT(count_table = count_table,
              region_tb = cds_tb)
fft_tb %>% 
  head(15) %>%
  knitr::kable()
```


|    signal| freq|    period|
|---------:|----:|---------:|
| 0.0221807|  199|  3.005025|
| 0.0049231|   25| 23.920000|
| 0.0046483|   24| 24.916667|
| 0.0045493|   27| 22.148148|
| 0.0041448|   28| 21.357143|
| 0.0038850|   30| 19.933333|
| 0.0038517|   26| 23.000000|
| 0.0035768|   32| 18.687500|
| 0.0035188|   29| 20.620690|
| 0.0035124|   34| 17.588235|
| 0.0035022|   31| 19.290323|
| 0.0034541|   55| 10.872727|
| 0.0032312|   39| 15.333333|
| 0.0031251|   38| 15.736842|
| 0.0031005|   37| 16.162162|


``` r
p_fft <- fft_tb %>% 
  filter(period < 7) %>% 
  plot()
p_fft
```

![](../../vignettes/Degradation_profiling_At_files/figure-html/plot_fft_At-1.png)<!-- -->


## Frame preference

The third analysis provided in this downstream analysis is the count of reads falling into each translation frame (1, 2 or 3). The `frame_preference` function takes the same inputs as the 2 previous functions.


``` r
frame_pref_tb <- frame_preference(count_table = count_table,
                                  region_tb = cds_tb)
frame_pref_tb %>%
  knitr::kable()
```


| frame|       pc|
|-----:|--------:|
|     1| 31.38806|
|     2| 35.97309|
|     3| 32.63885|


Visualisation

``` r
p_frame_pref <- plot(frame_pref_tb) +
  xlab("Nucleotide in codon")
p_frame_pref
```

![](../../vignettes/Degradation_profiling_At_files/figure-html/plot_frame_pref_At-1.png)<!-- -->


Final plot:

``` r
library(ggpubr)
p1 <- p_stop_metacounts + theme(legend.position = "none")
p3 <- p_frame_pref + theme(legend.position = "none")
ggarrange(p1, 
          ggarrange(p_fft, p3, nrow=2, labels = c("B","C")),
          widths = c(3,1),
          labels = c("A","")
)
```

![](../../vignettes/Degradation_profiling_At_files/figure-html/combined_plot_metacounts_At-1.png)<!-- -->


_____


Notebook rendering time and memory:

``` r
time_end <- now()
rendering_time <- time_end - time_start
rendering_time
#> Time difference of 15.94422 hours
pryr::mem_used()
#> 1.36 GB
```