Co-translational exoribonucleolytic degradation profiling using 5’-end RNA-Seq data

Complete source of this page is available in the notebook Degradation_profiling.Rmd.

library(tidyverse)
library(ggpubr)
library(rnaends)

Co-translational exoribonucleolytic degradation analysis using 5’-end RNA-Seq data

Based on the publication from Khemici V, Prados J, Linder P, Redder P (2015) Decay-Initiating Endoribonucleolytic Cleavage by RNase Y Is Kept under Tight Control via Sequence Preference and Sub-cellular Localisation. PLoS Genetics, 11. https://doi.org/10.1371/journal.pgen.1005577.

Download datasets

Data from this article are available on Gene Expression Omnibus (GEO) database: GSE68811

For this notebook, we will only use 4 RNA-Seq samples corresponding to 4 EMOTE assays on Staphylococcus aureus SA564 strain (GSM1682239, GSM1682240, GSM1682241, GSM1682242).

We download data directly in a the “extdata” directory of the GitLab project in the sub-directory called “RNaseY_Khemici” where the reference genome and the genome annotation files are already provided when the project is cloned.

options(timeout = 3600)

list_SRR = c("SRR2017583", "SRR2017584", "SRR2017585", "SRR2017586")

for (i in list_SRR){
    raw_fastq = paste0("../extdata/RNaseY_Khemici/", i, ".fastq.gz")
    if (!file.exists(raw_fastq))
        download.file(paste0("https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=",i),
                      destfile = raw_fastq)
}

Pre-processing

Parse reads (+ read features)

In order to make reads mappable, we must remove the oligonucleotide that is still present at the 5’-end of the reads in the FASTQ files.

We use the function parse_read to remove this oligonucleotide and filter out reads for which the oligo does not contain all the expected elements. These reads have the same structure as in Prados et al. 2016 (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3211-3).

parse_reads requires a read_features table which describes each element that should be found in the reads.

Following the Material & Methods section of the article, reads are not trimmed and should look like this (X is the sequence of the 5’end of RNA to be mapped to the reference):

      1       2            5      6            20     21    27      28 30     31                  50
      N     - (EMOTE barcode) -   CGGCACCAACCGAGG    - VVVVVVV -     CGC     - XXXXXXXXXXXXXXXXXXXXX
1 random nt -  EMOTE barcode  -   recognition.seq    -   UMI   - control.seq -     RNA 5'-end

We need to keep only the “RNA 5’-end” part that should map on the reference genome.

init_read_features and add_read_feature functions are used to create a read_features table which specifies the read structure.

The initialization of the read_features table is initialized with the init_read_features function because the minimum read_features table contains the readseq feature (i.e. the mappable part to keep in the FASTQ file).

According to our scheme, the readseq part begins at nucleotide 31 and is 20 nucleotides long (Illumina50).

rf <- init_read_features(start = 31, width = 20)

We add other features such as barcode which begins at position 2 and is 4 bases long.

For the pattern parameter, we specify the authorized barcode sequences to validate the feature (here we allowed 1 out of 4 sequences).

We use the pattern_type parameter to specify the type of pattern we seek: “constant” means that we query the exact character string provided in the pattern parameter allowing n number of mismatches (specified by the max_mismatch parameter).

Finally, the readid_prepend parameter is used to specify if the feature retrieved should be stored in the read identifier (1st line of a read records in the FASTQ file). It may be useful to keep the information of the sequence found of a feature since they are removed and therefore absent from the FASTQ output file.

rf <- add_read_feature(rf,
                       name = "barcode",
                       start = 2,
                       width = 4,
                       pattern_type = "constant",
                       pattern = c("GTAT","CAAG","TACA","TCGG"),
                       readid_prepend = FALSE)

The recognition sequence feature begins at position 6 and is 15 bases long. We expect a single unique sequence with maximum 1 mismatch:

rf <- add_read_feature(rf,
                       name = "Recog. seq",
                       start = 6,
                       width = 15,
                       pattern_type = "constant",
                       pattern = "CGGCACCAACCGAGG",
                       max_mismatch = 1,
                       readid_prepend = FALSE)

The UMI feature begins at position 21 and is 7 bases long.

UMIs are random nucleotide sequences. Thus, the patern_type is set to “nucleotides”. It it possible to provide a list of expected characters (here A, C or G).

We set readid_prepend = TRUE. This is mandatory as we will need the UMI associated to the 5’-end RNA during the quantification to remove PCR duplicates.

Note: the readseq feature has the same pattern_type than UMI which is set by default to A, T, C or G random sequence.

rf <- add_read_feature(rf,
                       name = "UMI",
                       start = 21,
                       width = 7,
                       pattern_type = "nucleotides",
                       pattern = "ACG",
                       max_mismatch = 0,
                       readid_prepend = TRUE)

The control sequence feature begins at position 28 and is 3 bases long. We expect a specific sequence (pattern_type = "constant").

rf <- add_read_feature(rf,
                       name = "Control seq.",
                       start = 28,
                       width = 3,
                       pattern_type = "constant",
                       pattern = "CGC",
                       max_mismatch = 0,
                       readid_prepend = FALSE)

The final read_features table describing the read structure is as follows:

rf %>%
  arrange(start) %>%
  knitr::kable()

name

start

width

pattern_type

pattern

max_mismatch

readid_prepend

barcode

2

4

constant

GTAT, CA….

0

FALSE

Recog. seq

6

15

constant

CGGCACCA….

1

FALSE

UMI

21

7

nucleotides

ACG

0

TRUE

Control seq.

28

3

constant

CGC

0

FALSE

readseq

31

20

nucleotides

ACTG

0

FALSE

Now, we can use the parse_reads function on the 4 FASTQ files using the read_features table as parameter.

lapply(
    list.files("../extdata/RNaseY_Khemici", pattern = ".fastq.gz$", full.names = TRUE),
    FUN = parse_reads,
    yieldsize = 5e6, # Number of reads parsed at each iteration
    features = rf, # The features table
    out_dir = "../extdata/RNaseY_Khemici/parse_results",
    keep_invalid_reads = FALSE, # We don't want to store invalid reads
    force = FALSE,
) %>%
  bind_rows() %>%
  head(15) %>%
  knitr::kable()
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017583_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017584_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017585_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017586_parse_report.csv

is_valid_readseq

is_valid_barcode

is_valid_Recog. seq.

is_valid_UMI

is_valid_Control seq.

count

FALSE

FALSE

FALSE

FALSE

FALSE

7

FALSE

FALSE

FALSE

TRUE

FALSE

25

FALSE

FALSE

FALSE

TRUE

TRUE

2

FALSE

FALSE

TRUE

FALSE

FALSE

12

FALSE

FALSE

TRUE

TRUE

FALSE

43

FALSE

TRUE

TRUE

FALSE

FALSE

5404

FALSE

TRUE

TRUE

FALSE

TRUE

696

FALSE

TRUE

TRUE

TRUE

FALSE

16155

FALSE

TRUE

TRUE

TRUE

TRUE

17331

TRUE

FALSE

FALSE

FALSE

FALSE

3294

TRUE

FALSE

FALSE

FALSE

TRUE

50

TRUE

FALSE

FALSE

TRUE

FALSE

16606

TRUE

FALSE

FALSE

TRUE

TRUE

154

TRUE

FALSE

TRUE

FALSE

FALSE

4571

TRUE

FALSE

TRUE

FALSE

TRUE

61

parse_reads returned a table which contains some information on the features validity.

Upset plots for each raw original FASTQ file.

report_SRR2017583 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017583_parse_report.csv", show_col_types = FALSE)

report_SRR2017583 %>%
  filter(count >=100) %>%
  reads_parse_report_ggupset(margin=0.2)

report_SRR2017584 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017584_parse_report.csv", show_col_types = FALSE)

report_SRR2017584 %>%
  filter(count >=10) %>%
  reads_parse_report_ggupset(margin=0.2) 

report_SRR2017585 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017585_parse_report.csv", show_col_types = FALSE)

report_SRR2017585 %>%  
  filter(count >=100) %>%
  reads_parse_report_ggupset(margin = 0.15)

report_SRR2017586 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017586_parse_report.csv", show_col_types = FALSE)

report_SRR2017586 %>%
  filter(count >=100) %>%
  reads_parse_report_ggupset(margin = 0.15)

All repotrs combined:

report_SRR2017583 %>%
  bind_rows(report_SRR2017584) %>%
  bind_rows(report_SRR2017585) %>%
  bind_rows(report_SRR2017586) %>%
  group_by(across(starts_with("is_valid_"))) %>%
  summarise(count = sum(count), .groups = "drop") %>% 
  filter(count>5000) %>%
  reads_parse_report_ggupset(angle = 45, margin=0.35)

Quantification

Mapping

Reads are now ready to be aligned on the reference genome using the map_fastq_files function.

Note: a single FASTQ file can be mapped with the map_fastq function.

bf <- map_fastq_files(
  list_fastq_files = list.files("../extdata/RNaseY_Khemici/parse_results", 
                                pattern = "_valid.fastq.gz$", 
                                full.names = TRUE) ,
  index_name = "../extdata/RNaseY_Khemici/CP010890.bt", # Path of the genome index (created if not exists)
  fasta_name = "../extdata/RNaseY_Khemici/CP010890.fasta", # Path of the reference genome file
  bam_dir = "../extdata/RNaseY_Khemici/mapping_results", # Output directory
  max_map_mismatch = 1,
  aligner = "bowtie", # Either bowtie or subread
  map_unique = TRUE,
  force = FALSE,
  threads = 3
)
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
bf
#> [[1]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam"
#> 
#> [[2]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam"
#> 
#> [[3]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam"
#> 
#> [[4]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam"

Mapping statistics

Use the map_stats_bam_files function to get some basic statistics on the mapping we’ve just made:

map_stats_bam_files(bam_files = bf) %>%
  knitr::kable()
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute

bam_file

mapped

unmapped

pc_mapped

SRR2017583_valid.fastq.gz.bam

1785951

1147874

60.87

SRR2017584_valid.fastq.gz.bam

585774

210122

73.60

SRR2017585_valid.fastq.gz.bam

999717

477088

67.69

SRR2017586_valid.fastq.gz.bam

1074276

494215

68.49

Quantification (counts computation)

In order to obtain counts by position, we use the quantify_from_bam_files function to parse previously obtained BAM files in bf.

This function allows the quantification of either the 5’-ends (quantify_5prime) or the 3’-ends (quantify_3prime), in both cases, it is an exact position quantification.

For UMIs to be taken into account for the quantification, we provide the read_features table with the “features” parameter.

The keep_UMI parameter is used to record the number of copy of each UMI by exact position. This level of information is not nedded for this analysis and we use it as default (FALSE).

count_table <- quantify_from_bam_files(
    bam_files = bf,
    sample_ids = c("WT_rep1","WT_rep2","DY_rep1","DY_rep2"),
    fun = quantify_5prime, # Either quantify_5prime or quantify_3prime
    features = rf,
    keep_UMI = FALSE) 
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam.csv
count_table %>% 
  head(15) %>%
  knitr::kable()

sample_id

rname

position

strand

reads

count

WT_rep1

CP010890.1

313

+

67

34

WT_rep1

CP010890.1

314

+

7

3

WT_rep1

CP010890.1

316

+

5

3

WT_rep1

CP010890.1

362

+

4

3

WT_rep1

CP010890.1

370

+

1

1

WT_rep1

CP010890.1

374

+

1

1

WT_rep1

CP010890.1

377

+

1

1

WT_rep1

CP010890.1

412

+

1

1

WT_rep1

CP010890.1

413

+

1

1

WT_rep1

CP010890.1

437

+

1

1

WT_rep1

CP010890.1

445

+

1

1

WT_rep1

CP010890.1

467

+

1

1

WT_rep1

CP010890.1

491

+

1

1

WT_rep1

CP010890.1

493

+

9

3

WT_rep1

CP010890.1

513

+

1

1

The above count table will allow us to perform some downstream analysis.

Note: here, the count column corresponds to the number of different UMIs mapped at this exact position. This is why it is lower or equal to the reads values (due to PCR duplicates).

Downstream analysis

Ribosome profilling

Note: the following functions can take BAM files as input instead of count table. In that case, additional parameters must be provided that correspond to the quantify_from_bam_files function parameters.

Annotation table preparation

In order to retrieve the number of reads mapped on the CDSs, we need to have access to the position (start and stop) of these CDSs.

We simply retrieved the annotation file of the reference strain SA564 from the NCBI that we filter to get only the feature corresponding to “CDS”.

Note: we also switch the positions for CDS on the reverse strand start and stop as their genomic positions are inverted.

annot <- read_tsv("../extdata/RNaseY_Khemici/CP010890.gff3", 
                  show_col_types = FALSE,
                  col_names = c("chr",
                                "type",
                                "feature",
                                "begin",
                                "end",
                                "V1",
                                "strand",
                                "V2",
                                "annotation")) %>% 
  filter(feature=="CDS") %>%
  mutate(start = ifelse(strand=="-" & begin<end, end, begin),
         end = ifelse(strand=="-" & begin<end, begin, end)) %>%
  select(chr, feature, strand, start, end, annotation)

annot %>% 
  head(15) %>%
  knitr::kable()

chr

feature

strand

start

end

annotation

CP010890.1

CDS

+

517

1878

ID=cds-ALD79596.1;Parent=gene-RT87_00005;Dbxref=NCBI_GP:ALD79596.1;Name=ALD79596.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001290432.1;locus_tag=RT87_00005;product=chromosomal replication initiation protein;protein_id=ALD79596.1;transl_table=11

CP010890.1

CDS

+

2156

3289

ID=cds-ALD79597.1;Parent=gene-RT87_00010;Dbxref=NCBI_GP:ALD79597.1;Name=ALD79597.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000969809.1;locus_tag=RT87_00010;product=DNA polymerase III subunit beta;protein_id=ALD79597.1;transl_table=11

CP010890.1

CDS

+

3679

3915

ID=cds-ALD79598.1;Parent=gene-RT87_00015;Dbxref=NCBI_GP:ALD79598.1;Name=ALD79598.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000249646.1;locus_tag=RT87_00015;product=RNA-binding protein;protein_id=ALD79598.1;transl_table=11

CP010890.1

CDS

+

3912

5024

ID=cds-ALD79599.1;Parent=gene-RT87_00020;Dbxref=NCBI_GP:ALD79599.1;Name=ALD79599.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000774109.1;locus_tag=RT87_00020;product=recombinase RecF;protein_id=ALD79599.1;transl_table=11

CP010890.1

CDS

+

5034

6968

ID=cds-ALD79600.1;Parent=gene-RT87_00025;Dbxref=NCBI_GP:ALD79600.1;Name=ALD79600.1;Note=negatively supercoils closed circular double-stranded DNA;gbkey=CDS;gene=gyrB;inference=EXISTENCE: similar to AA sequence:SwissProt:Q6GD85.3;locus_tag=RT87_00025;product=DNA gyrase subunit B;protein_id=ALD79600.1;transl_table=11

CP010890.1

CDS

+

7005

9674

ID=cds-ALD79601.1;Parent=gene-RT87_00030;Dbxref=NCBI_GP:ALD79601.1;Name=ALD79601.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001561428.1;locus_tag=RT87_00030;product=DNA gyrase subunit A;protein_id=ALD79601.1;transl_table=11

CP010890.1

CDS

-

10574

9762

ID=cds-ALD79602.1;Parent=gene-RT87_00035;Dbxref=NCBI_GP:ALD79602.1;Name=ALD79602.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q2G2P8.2;locus_tag=RT87_00035;product=carbohydrate kinase;protein_id=ALD79602.1;transl_table=11

CP010890.1

CDS

+

10900

12414

ID=cds-ALD79603.1;Parent=gene-RT87_00040;Dbxref=NCBI_GP:ALD79603.1;Name=ALD79603.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q6GKT7.1;locus_tag=RT87_00040;product=histidine ammonia-lyase;protein_id=ALD79603.1;transl_table=11

CP010890.1

CDS

+

12793

14079

ID=cds-ALD79604.1;Parent=gene-RT87_00045;Dbxref=NCBI_GP:ALD79604.1;Name=ALD79604.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q2YUR2.1;locus_tag=RT87_00045;product=seryl-tRNA synthetase;protein_id=ALD79604.1;transl_table=11

CP010890.1

CDS

+

14730

15425

ID=cds-ALD79605.1;Parent=gene-RT87_00050;Dbxref=NCBI_GP:ALD79605.1;Name=ALD79605.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000190725.1;locus_tag=RT87_00050;product=azaleucine resistance protein AzlC;protein_id=ALD79605.1;transl_table=11

CP010890.1

CDS

+

15422

15751

ID=cds-ALD79606.1;Parent=gene-RT87_00055;Dbxref=NCBI_GP:ALD79606.1;Name=ALD79606.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000628045.1;locus_tag=RT87_00055;product=membrane protein;protein_id=ALD79606.1;transl_table=11

CP010890.1

CDS

+

16114

17082

ID=cds-ALD79607.1;Parent=gene-RT87_00060;Dbxref=NCBI_GP:ALD79607.1;Name=ALD79607.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001796480.1;locus_tag=RT87_00060;product=homoserine acetyltransferase;protein_id=ALD79607.1;transl_table=11

CP010890.1

CDS

+

17390

18313

ID=cds-ALD79608.1;Parent=gene-RT87_00065;Dbxref=NCBI_GP:ALD79608.1;Name=ALD79608.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000491736.1;locus_tag=RT87_00065;product=membrane protein;protein_id=ALD79608.1;transl_table=11

CP010890.1

CDS

+

18328

20295

ID=cds-ALD79609.1;Parent=gene-RT87_00070;Dbxref=NCBI_GP:ALD79609.1;Name=ALD79609.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001081633.1;locus_tag=RT87_00070;product=hypothetical protein;protein_id=ALD79609.1;transl_table=11

CP010890.1

CDS

+

20292

20738

ID=cds-ALD79610.1;Parent=gene-RT87_00075;Dbxref=NCBI_GP:ALD79610.1;Name=ALD79610.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q2G2T3.1;locus_tag=RT87_00075;product=50S ribosomal protein L9;protein_id=ALD79610.1;transl_table=11

Metagene counts

Now that we have the count table and the annotation table, we can compute the metacount profiles around the first nucleotide of the start or stop codon using the metacounts_profile_feature function.

First, we extract the regions spanning around the feature (START or STOP codon) with metacounts_profile_feature:

start_tb <- annot %>%
  select(rname=chr, strand, position=start)
start_metacounts <- metacounts_feature(count_table, start_tb, width=100)

start_metacounts %>% 
  head(15) %>%
  knitr::kable()

x

count

-100

0.0059492

-99

0.0057643

-98

0.0052539

-97

0.0053888

-96

0.0060536

-95

0.0055115

-94

0.0050401

-93

0.0056643

-92

0.0050730

-91

0.0044526

-90

0.0050401

-89

0.0057813

-88

0.0059047

-87

0.0050401

-86

0.0049141

Then, we can visualize the metacounts profile:

p_start_metacounts <- start_metacounts %>% 
  filter(between(x, -42, 75)) %>% 
  mutate(`Translation frame` = as.factor(x%%3+1)) %>%
  ggplot(aes(x = x, y = count, fill = `Translation frame`)) + 
    geom_bar(stat = "identity", alpha=0.8) + 
    labs(x = " Distance to the first nucleotide of start codon", 
         y = "Metacounts") +
    scale_x_continuous(breaks = seq(-40, 75, by=10)) +
    scale_fill_manual(values = c("#ff5050","#3366ff", "#ffcc00")) +
    theme_bw() + 
    theme(legend.position = "top")
p_start_metacounts  

We perform the same for the STOP codon (adjusting for the position of the 1st nucleotide):

stop_tb <- annot %>%
  select(rname=chr, strand, position=end) %>%
  mutate(position = ifelse(strand=='+', position-2, position+2))
stop_metacounts <- metacounts_feature(count_table, stop_tb, width=100)
stop_metacounts %>% 
  head(15) %>%
  knitr::kable()

x

count

-100

0.0055727

-99

0.0044619

-98

0.0068743

-97

0.0061145

-96

0.0050028

-95

0.0060474

-94

0.0051904

-93

0.0035142

-92

0.0063497

-91

0.0061145

-90

0.0048556

-89

0.0064173

-88

0.0056928

-87

0.0048775

-86

0.0066952

Metacount profile obtained:

stop_metacounts %>% 
  filter(between(x, -81, 20)) %>% 
  mutate(`Translation frame` = as.factor(x%%3+1)) %>%
  ggplot(aes(x = x, y = count, fill = `Translation frame`)) + 
  geom_bar(stat = "identity", alpha=0.8) + 
  labs(x = " Distance to the first nucleotide of stop codon", 
       y = "Metacounts") +
  scale_fill_manual(values = c("#ff5050","#3366ff", "#ffcc00")) +
  scale_x_continuous(breaks = seq(-80,20, by=10)) +
  theme_bw() +
  theme(legend.position = "top")

Periodicity

From the genome annotation, we can extract regions corresponding to protein coding genes, compute the metacounts profile for all these regions and use FFT to analyse periodicity in the counts.

cds_tb <- annot %>% 
  select(rname = chr, strand, begin=start, end)
fft_tb <- FFT(count_table = count_table,
              region_tb = cds_tb)
fft_tb %>% 
  head(15) %>%
  knitr::kable()

signal

freq

period

0.1097155

137

3.007299

0.0087828

136

3.029412

0.0079080

201

2.049751

0.0060486

197

2.091371

0.0059655

138

2.985507

0.0056546

35

11.771429

0.0055253

57

7.228070

0.0054040

135

3.051852

0.0053650

187

2.203209

0.0053174

53

7.773585

0.0052531

155

2.658065

0.0052046

50

8.240000

0.0051031

32

12.875000

0.0050365

173

2.381503

0.0049383

81

5.086420

Visualization with ggplot2

p_fft <- fft_tb %>% 
  filter(period < 7) %>% 
  ggplot(aes(x = period, y = signal)) + 
  geom_line(aes()) + 
  labs(x = "Period", y = "Signal") +
  scale_color_brewer(palette="Set1") +
  theme_bw()
p_fft

Frame preference

The third analysis provided in this downstream analysis is the count of reads falling into each translation frame (1, 2 or 3). The frame_preference function takes the same inputs as the 2 previous functions.

frame_pref_tb <- frame_preference(count_table = count_table,
                                  region_tb = cds_tb)
frame_pref_tb %>%
  knitr::kable()

frame

pc

1

24.54694

2

41.39385

3

34.05922

Visualization with ggplot2

p_frame_pref <- frame_pref_tb %>% 
  ggplot(aes(x = frame, y = pc, fill = as.factor(frame))) +
    geom_col(position = "dodge")  + 
    labs(x = " Translation frame", y = "Percentage") +
    theme_bw() + 
    scale_fill_manual(name = "Translation frame", values = c("#ff5050", "#3366ff", "#ffcc00")) +
  theme(legend.position = "top")
p_frame_pref

Final plot for the paper:

library(ggpubr)
p1 <- p_start_metacounts + theme(legend.position = "none")
p3 <- p_frame_pref + theme(legend.position = "none")
ggarrange(p1, 
          ggarrange(p_fft, p3, nrow=2, labels = c("B","C")),
          widths = c(3,1),
          labels = c("A","")
)