Co-translational exoribonucleolytic degradation profiling in A. thaliana using 5’ end RNA-Seq data

Complete source of this page is available in the notebook Degradation_profiling_At.Rmd.

library(tidyverse)
time_start <- now()
library(ggpubr)
library(rnaends)

Co-translational exoribonucleolytic degradation analysis using 5’-end RNA-Seq data

Based on the publication from Cheng-Yu Hou, Wen-Chi Lee, Hsiao-Chun Chou, Ai-Ping Chen, Shu-Jen Chou, and Ho-Ming Chen (2016) Global Analysis of Truncated RNA Ends Reveals New Insights into Ribosome Stalling in Plants, Plant Cell, https://doi.org/10.1105/tpc.16.00295.

Download datasets

Data generated from this article are available on Gene Expression Omnibus (GEO) database: GSE77549

For this notebook, we will only use one sample corresonding to inflorescences of wild-type Arabidopsis from which total RNA was extacted and isolated for PARE library construction and sequencing (GSM2253889). The raw sequencing data are downloaded in the “extdata” directory of the GitLab project in the subdirectory “Athaliana_Hou” using the NCBI fasterq-dump tool from NCBI:

fasterq-dump SRR3143655 --outdir ../extdata/Athaliana_Hou

Pre-processing

Parse reads (+ read features)

In order to make reads mappable, we must only keep the first 20 nucleotides as mentionned in the Material&Methods of the original article.

This is done with the parse_read function that validates the sequence composition (only A, T, C and G), and extracts the mappable part into a new FASTQ file for mapping to the reference sequences.

parse_reads utilizes a read_features table that describes each element that should be found in the reads. In this case, there is only one read feature: the first 20 nucleotides of the reads.

rf <- init_read_features(start = 1, width = 20)
rf %>%
  knitr::kable()

name

start

width

pattern_type

pattern

max_mismatch

readid_prepend

readseq

1

20

nucleotides

ACTG

0

FALSE

Now, we can use the parse_reads function on the raw reads FASTQ file using the read_features table as parameter:

parse_reads("../extdata/Athaliana_Hou/SRR3143655.fastq.gz",
            yieldsize = 1e6, # Number of reads parsed at each iteration
            features = rf, # The features table
            out_dir = "../extdata/Athaliana_Hou/parsed_reads",
            keep_invalid_reads = FALSE, # We don't want to store invalid reads
            force = FALSE,
) %>%
  knitr::kable()

is_valid_readseq

count

FALSE

35864

TRUE

56567785

We observe that, from the 56,603,649 total reads, only 35,864 were discarded, and we are left with more than 56 millions reads that can be used further.

Quantification

Reads are now ready to be aligned on the reference sequences using the map_fastq function.

Reference sequences

For exoribonucleolytic degradation analysis, and particularly for periodicity analysis, we need to map the read sequences to protein coding sequences. Moreover, to study the 5’ends of transcripts around the START and STOP codon, we also need the 5’UTR and the 3’UTR.

To perform this, we will use the data from the TAIR10 database, and more precisely the following files avaliable for download:

These files are availabe (gzipped) in the extdata/Athaliana_Hou/reference_sequences directory of the GitLab project.

Mapping to reference sequences

The map_fastq function is called with the reads preprocessed previously in SRR3143655_valid.fastq.gz and with the reference sequences (cDNAs). The bowtie aligner is specified. For the mapping, we specify that one mixmatch is allowed in the alignments, and that reads should map to a unique location (map_unique = TRUE).

bf <- map_fastq(
  "../extdata/Athaliana_Hou/parsed_reads/SRR3143655_valid.fastq.gz", 
  index_name = "../extdata/Athaliana_Hou/reference_sequences/TAIR10_cdna_20101214.bt", # Path of the genome index (created if not exists)
  fasta_name = "../extdata/Athaliana_Hou/reference_sequences/TAIR10_cdna_20101214.gz", # Path of the reference genome file
  bam_dir = "../extdata/Athaliana_Hou/mapping_results", # Output directory
  max_map_mismatch = 1,
  aligner = "bowtie", # Either bowtie or subread
  map_unique = TRUE,
  force = FALSE,
  threads = 3
)
#> Warning in FUN(X[[i]], ...): Index directory already exists remove it if you
#> want to recreate it

Quantification

In order to obtain counts by position, we use the quantify_from_bam_files function to parse previously obtained BAM file in bf.

This function allows the quantification of either the 5’-ends (quantify_5prime) or the 3’-ends (quantify_3prime), in both cases, it is an exact position quantification.

count_table <- quantify_from_bam_files(
    bam_files = bf,
    sample_ids = c("inflorescence"),
    fun = quantify_5prime, # Either quantify_5prime or quantify_3prime
    features = rf,
    force = FALSE) 
count_table %>% 
  head(15) %>%
  knitr::kable()

sample_id

rname

position

strand

count

inflorescence

AT1G50920.1

1

+

1

inflorescence

AT1G50920.1

2

+

2

inflorescence

AT1G50920.1

7

+

1

inflorescence

AT1G50920.1

8

+

2

inflorescence

AT1G50920.1

12

+

2

inflorescence

AT1G50920.1

22

+

1

inflorescence

AT1G50920.1

27

+

1

inflorescence

AT1G50920.1

30

+

1

inflorescence

AT1G50920.1

36

+

1

inflorescence

AT1G50920.1

37

+

1

inflorescence

AT1G50920.1

46

+

1

inflorescence

AT1G50920.1

47

+

1

inflorescence

AT1G50920.1

59

+

1

inflorescence

AT1G50920.1

60

+

1

inflorescence

AT1G50920.1

68

+

1

The above count table will allow us to perform some downstream analysis.

Downstream analysis

Annotation table preparation

In order to retrieve the number of reads mapped on the CDSs, we need to have access to the position (start and stop) of these CDSs.

We will use the BioStrings library to load the TAIR10 FASTQ files, and determine the lengths of the 5’ and 3’ UTRs of the cDNAs.

library(Biostrings)

Load all reference sequences (cDNAs) from FASTA:

cdna_ref <- readDNAStringSet("../extdata/Athaliana_Hou/reference_sequences/TAIR10_cdna_20101214.gz")

Extract names from FASTA descriptions:

cdna <- cdna_ref %>%
  names %>%
  tibble(fa=.) %>%
  separate(fa, sep=" \\| ", into=c("rname", "symbols", "desc", "loc")) %>%
  mutate(len = str_extract(loc, "LENGTH=\\d+"),
         len = as.numeric(str_extract(len, "\\d+$")))

Then, we use 5’ and 3’ UTR FASTA files to determine their length, and thus the location of START and STOP codons on cDNAs:

utr5 <- readDNAStringSet("../extdata/Athaliana_Hou/reference_sequences/TAIR10_5_utr_20101028.gz")
utr5 <- utr5 %>%
  names %>%
  tibble(fa=.) %>%
  separate(fa, sep=" \\| ", into=c("rname", "ranges", "loc")) %>%
  mutate(utr5len = str_extract(loc, "LENGTH=\\d+"),
         utr5len = as.numeric(str_extract(utr5len, "\\d+$"))) %>%
  select(rname, utr5len)

Same on the 3’UTR:

utr3 <- readDNAStringSet("../extdata/Athaliana_Hou/reference_sequences/TAIR10_3_utr_20101028.gz")
utr3 <- utr3 %>%
  names %>%
  tibble(fa=.) %>%
  separate(fa, sep=" \\| ", into=c("rname", "ranges", "loc")) %>%
  mutate(utr3len = str_extract(loc, "LENGTH=\\d+"),
         utr3len = as.numeric(str_extract(utr3len, "\\d+$"))) %>%
  select(rname, utr3len)

We merge all the information in the cdna tibble:

cdna <- cdna %>%
  left_join(utr5, by = join_by(rname)) %>%
  left_join(utr3, by = join_by(rname)) %>%
  replace_na(list(utr5len = 0, utr3len=0))

And, we reformat these data into the annot table that will be used for analysing regions around features (START and STOP codons), and periodicity on the CDSs.

annot <- tibble(
  rname = cdna$rname,
  feature = "CDS",
  strand = "+",
  start = 1 + cdna$utr5len,
  end = cdna$len   - cdna$utr3len,
  annotation = names(cdna_ref)
)
annot %>%
  head(15) %>%
  knitr::kable()

rname

feature

strand

start

end

annotation

AT1G51370.2

CDS

+

1

1041

AT1G51370.2 | Symbols: | F-box/RNI-like/FBD-like domains-containing protein | chr1:19045615-19046825 FORWARD LENGTH=1118

AT1G50920.1

CDS

+

119

2134

AT1G50920.1 | Symbols: | Nucleolar GTP-binding protein | chr1:18870139-18872930 FORWARD LENGTH=2394

AT1G36960.1

CDS

+

1

546

AT1G36960.1 | Symbols: | unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT1G48095.1); Has 54 Blast hits to 54 proteins in 2 species: Archae - 0; Bacteria - 0; Metazoa - 0; Fungi - 0; Plants - 54; Viruses - 0; Other Eukaryotes - 0 (source: NCBI BLink). | chr1:14014796-14015508 FORWARD LENGTH=546

AT1G44020.1

CDS

+

1

1734

AT1G44020.1 | Symbols: | Cysteine/Histidine-rich C1 domain family protein | chr1:16716112-16718656 REVERSE LENGTH=2314

AT1G15970.1

CDS

+

375

1433

AT1G15970.1 | Symbols: | DNA glycosylase superfamily protein | chr1:5486319-5488868 REVERSE LENGTH=1658

AT1G73440.1

CDS

+

56

820

AT1G73440.1 | Symbols: | calmodulin-related | chr1:27611363-27612815 FORWARD LENGTH=1453

AT1G75120.1

CDS

+

147

1355

AT1G75120.1 | Symbols: RRA1 | Nucleotide-diphospho-sugar transferase family protein | chr1:28196857-28198802 REVERSE LENGTH=1520

AT1G17600.1

CDS

+

93

3242

AT1G17600.1 | Symbols: | Disease resistance protein (TIR-NBS-LRR class) family | chr1:6052936-6056664 REVERSE LENGTH=3332

AT1G50860.1

CDS

+

1

3846

AT1G50860.1 | Symbols: | transposable element gene | chr1:18850511-18854356 REVERSE LENGTH=3846

AT1G51380.1

CDS

+

79

1257

AT1G51380.1 | Symbols: | DEA(D/H)-box RNA helicase family protein | chr1:19047882-19050162 FORWARD LENGTH=1452

AT1G77370.1

CDS

+

74

466

AT1G77370.1 | Symbols: | Glutaredoxin family protein | chr1:29073843-29074796 FORWARD LENGTH=620

AT1G44090.1

CDS

+

1

1158

AT1G44090.1 | Symbols: ATGA20OX5, GA20OX5 | gibberellin 20-oxidase 5 | chr1:16760677-16762486 REVERSE LENGTH=1158

AT1G10950.1

CDS

+

114

1883

AT1G10950.1 | Symbols: TMN1, AtTMN1 | transmembrane nine 1 | chr1:3659209-3663984 FORWARD LENGTH=2245

AT1G31870.1

CDS

+

161

1846

AT1G31870.1 | Symbols: | unknown protein; FUNCTIONS IN: molecular_function unknown; INVOLVED IN: biological_process unknown; LOCATED IN: cytosol, nucleus; EXPRESSED IN: 25 plant structures; EXPRESSED DURING: 15 growth stages; CONTAINS InterPro DOMAIN/s: Protein of unknown function DUF2050, pre-mRNA-splicing factor (InterPro:IPR018609); Has 13303 Blast hits to 9618 proteins in 489 species: Archae - 8; Bacteria - 444; Metazoa - 7153; Fungi - 2294; Plants - 942; Viruses - 128; Other Eukaryotes - 2334 (source: NCBI BLink). | chr1:11436747-11440109 FORWARD LENGTH=2134

AT1G51360.1

CDS

+

25

657

AT1G51360.1 | Symbols: ATDABB1, DABB1 | dimeric A/B barrel domainS-protein 1 | chr1:19044184-19044900 FORWARD LENGTH=717

Metagene counts

Now that we have the count table and the annotation table, we can compute the metacount profiles around the first nucleotide of the START or STOP codon using the metacounts_feature function.

First, we extract the regions spanning around the feature (START or STOP codon) with metacounts_feature:

start_tb <- annot %>%
  select(rname, strand, position=start)
start_metacounts <- metacounts_feature(count_table, start_tb, width=100) 
start_metacounts %>% 
  head(15) %>%
  knitr::kable()

x

count

-100

0.0057777

-99

0.0058652

-98

0.0058652

-97

0.0060961

-96

0.0060172

-95

0.0056927

-94

0.0056102

-93

0.0056927

-92

0.0062604

-91

0.0057777

-90

0.0059554

-89

0.0055860

-88

0.0057777

-87

0.0059554

-86

0.0054522

Then, we can visualize the metacounts profile:

start_metacounts %>% 
  plot(from = -30, to = 50,
       translation_frame = TRUE,
       geom = "bar",
       x_lab = "Distance to the first nucleotide of start codon")

We perform the same for the STOP codon (adjusting for the position of the 1st nucleotide):

stop_tb <- annot %>%
  select(rname=rname, strand, position=end) %>%
  mutate(position = ifelse(strand=='+', position-2, position+2))
stop_metacounts <- metacounts_feature(count_table, stop_tb, width=100) 
stop_metacounts %>% 
  head(15) %>%
  knitr::kable()

x

count

-100

0.0049187

-99

0.0050808

-98

0.0057159

-97

0.0050104

-96

0.0048753

-95

0.0055718

-94

0.0050231

-93

0.0050231

-92

0.0057656

-91

0.0050692

-90

0.0049882

-89

0.0055718

-88

0.0050734

-87

0.0050614

-86

0.0056914

Metacount profile obtained:

p_stop_metacounts <- stop_metacounts %>% 
  plot(from = -60, to = 10,
       translation_frame = TRUE,
       geom = "bar",
       x_lab = "Distance to the first nucleotide of stop codon")
p_stop_metacounts

Periodicity

With the annotation table annot, we can compute the metacounts profile for all the CDS regions, and use FFT to analyse periodicity in the counts.

cds_tb <- annot %>% 
  select(rname, strand, begin=start, end)
stop_tb <- annot %>%
  select(rname=rname, strand, position=end) %>%
  mutate(position = ifelse(strand=='+', position-2, position+2))
fft_tb <- FFT(count_table = count_table,
              region_tb = cds_tb)
fft_tb %>% 
  head(15) %>%
  knitr::kable()

signal

freq

period

0.0221807

199

3.005025

0.0049231

25

23.920000

0.0046483

24

24.916667

0.0045493

27

22.148148

0.0041448

28

21.357143

0.0038850

30

19.933333

0.0038517

26

23.000000

0.0035768

32

18.687500

0.0035188

29

20.620690

0.0035124

34

17.588235

0.0035022

31

19.290323

0.0034541

55

10.872727

0.0032312

39

15.333333

0.0031251

38

15.736842

0.0031005

37

16.162162

p_fft <- fft_tb %>% 
  filter(period < 7) %>% 
  plot()
p_fft

Frame preference

The third analysis provided in this downstream analysis is the count of reads falling into each translation frame (1, 2 or 3). The frame_preference function takes the same inputs as the 2 previous functions.

frame_pref_tb <- frame_preference(count_table = count_table,
                                  region_tb = cds_tb)
frame_pref_tb %>%
  knitr::kable()

frame

pc

1

31.38806

2

35.97309

3

32.63885

Visualisation

p_frame_pref <- plot(frame_pref_tb) +
  xlab("Nucleotide in codon")
p_frame_pref

Final plot:

library(ggpubr)
p1 <- p_stop_metacounts + theme(legend.position = "none")
p3 <- p_frame_pref + theme(legend.position = "none")
ggarrange(p1, 
          ggarrange(p_fft, p3, nrow=2, labels = c("B","C")),
          widths = c(3,1),
          labels = c("A","")
)


Notebook rendering time and memory:

time_end <- now()
rendering_time <- time_end - time_start
rendering_time
#> Time difference of 15.94422 hours
pryr::mem_used()
#> 1.36 GB