Co-translational exoribonucleolytic degradation profiling using 5’-end RNA-Seq data
Complete source of this page is available in the notebook Degradation_profiling.Rmd.
library(tidyverse)
library(ggpubr)
library(rnaends)
Co-translational exoribonucleolytic degradation analysis using 5’-end RNA-Seq data
Based on the publication from Khemici V, Prados J, Linder P, Redder P (2015) Decay-Initiating Endoribonucleolytic Cleavage by RNase Y Is Kept under Tight Control via Sequence Preference and Sub-cellular Localisation. PLoS Genetics, 11. https://doi.org/10.1371/journal.pgen.1005577.
Download datasets
Data from this article are available on Gene Expression Omnibus (GEO) database: GSE68811
For this notebook, we will only use 4 RNA-Seq samples corresponding to 4 EMOTE assays on Staphylococcus aureus SA564 strain (GSM1682239, GSM1682240, GSM1682241, GSM1682242).
We download data directly in a the “extdata” directory of the GitLab project in the sub-directory called “RNaseY_Khemici” where the reference genome and the genome annotation files are already provided when the project is cloned.
options(timeout = 3600)
list_SRR = c("SRR2017583", "SRR2017584", "SRR2017585", "SRR2017586")
for (i in list_SRR){
raw_fastq = paste0("../extdata/RNaseY_Khemici/", i, ".fastq.gz")
if (!file.exists(raw_fastq))
download.file(paste0("https://trace.ncbi.nlm.nih.gov/Traces/sra-reads-be/fastq?acc=",i),
destfile = raw_fastq)
}
Pre-processing
Parse reads (+ read features)
In order to make reads mappable, we must remove the oligonucleotide that is still present at the 5’-end of the reads in the FASTQ files.
We use the function parse_read
to remove this oligonucleotide and filter out reads for which the oligo does not contain all the expected elements. These reads have the same structure as in Prados et al. 2016 (https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3211-3).
parse_reads
requires a read_features
table which describes each element that should be found in the reads.
Following the Material & Methods section of the article, reads are not trimmed and should look like this (X is the sequence of the 5’end of RNA to be mapped to the reference):
1 2 5 6 20 21 27 28 30 31 50
N - (EMOTE barcode) - CGGCACCAACCGAGG - VVVVVVV - CGC - XXXXXXXXXXXXXXXXXXXXX
1 random nt - EMOTE barcode - recognition.seq - UMI - control.seq - RNA 5'-end
We need to keep only the “RNA 5’-end” part that should map on the reference genome.
init_read_features
and add_read_feature
functions are used to create a read_features
table which specifies the read structure.
The initialization of the read_features
table is initialized with the init_read_features
function because the minimum read_features
table contains the readseq feature (i.e. the mappable part to keep in the FASTQ file).
According to our scheme, the readseq part begins at nucleotide 31 and is 20 nucleotides long (Illumina50).
rf <- init_read_features(start = 31, width = 20)
We add other features such as barcode which begins at position 2 and is 4 bases long.
For the pattern
parameter, we specify the authorized barcode sequences to validate the feature (here we allowed 1 out of 4 sequences).
We use the pattern_type
parameter to specify the type of pattern we seek: “constant” means that we query the exact character string provided in the pattern parameter allowing n number of mismatches (specified by the max_mismatch
parameter).
Finally, the readid_prepend
parameter is used to specify if the feature retrieved should be stored in the read identifier (1st line of a read records in the FASTQ file). It may be useful to keep the information of the sequence found of a feature since they are removed and therefore absent from the FASTQ output file.
rf <- add_read_feature(rf,
name = "barcode",
start = 2,
width = 4,
pattern_type = "constant",
pattern = c("GTAT","CAAG","TACA","TCGG"),
readid_prepend = FALSE)
The recognition sequence feature begins at position 6 and is 15 bases long. We expect a single unique sequence with maximum 1 mismatch:
rf <- add_read_feature(rf,
name = "Recog. seq",
start = 6,
width = 15,
pattern_type = "constant",
pattern = "CGGCACCAACCGAGG",
max_mismatch = 1,
readid_prepend = FALSE)
The UMI feature begins at position 21 and is 7 bases long.
UMIs are random nucleotide sequences. Thus, the patern_type
is set to “nucleotides”. It it possible to provide a list of expected characters (here A, C or G).
We set readid_prepend = TRUE
. This is mandatory as we will need the UMI associated to the 5’-end RNA during the quantification to remove PCR duplicates.
Note: the readseq feature has the same pattern_type
than UMI which is set by default to A, T, C or G random sequence.
rf <- add_read_feature(rf,
name = "UMI",
start = 21,
width = 7,
pattern_type = "nucleotides",
pattern = "ACG",
max_mismatch = 0,
readid_prepend = TRUE)
The control sequence feature begins at position 28 and is 3 bases long. We expect a specific sequence (pattern_type = "constant"
).
rf <- add_read_feature(rf,
name = "Control seq.",
start = 28,
width = 3,
pattern_type = "constant",
pattern = "CGC",
max_mismatch = 0,
readid_prepend = FALSE)
The final read_features
table describing the read structure is as follows:
rf %>%
arrange(start) %>%
knitr::kable()
name |
start |
width |
pattern_type |
pattern |
max_mismatch |
readid_prepend |
---|---|---|---|---|---|---|
barcode |
2 |
4 |
constant |
GTAT, CA…. |
0 |
FALSE |
Recog. seq |
6 |
15 |
constant |
CGGCACCA…. |
1 |
FALSE |
UMI |
21 |
7 |
nucleotides |
ACG |
0 |
TRUE |
Control seq. |
28 |
3 |
constant |
CGC |
0 |
FALSE |
readseq |
31 |
20 |
nucleotides |
ACTG |
0 |
FALSE |
Now, we can use the parse_reads
function on the 4 FASTQ files using the read_features
table as parameter.
lapply(
list.files("../extdata/RNaseY_Khemici", pattern = ".fastq.gz$", full.names = TRUE),
FUN = parse_reads,
yieldsize = 5e6, # Number of reads parsed at each iteration
features = rf, # The features table
out_dir = "../extdata/RNaseY_Khemici/parse_results",
keep_invalid_reads = FALSE, # We don't want to store invalid reads
force = FALSE,
) %>%
bind_rows() %>%
head(15) %>%
knitr::kable()
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017583_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017584_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017585_parse_report.csv
#> Demux report already exists set force=T to overwrite: ../extdata/RNaseY_Khemici/parse_results/SRR2017586_parse_report.csv
is_valid_readseq |
is_valid_barcode |
is_valid_Recog. seq. |
is_valid_UMI |
is_valid_Control seq. |
count |
---|---|---|---|---|---|
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
7 |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
25 |
FALSE |
FALSE |
FALSE |
TRUE |
TRUE |
2 |
FALSE |
FALSE |
TRUE |
FALSE |
FALSE |
12 |
FALSE |
FALSE |
TRUE |
TRUE |
FALSE |
43 |
FALSE |
TRUE |
TRUE |
FALSE |
FALSE |
5404 |
FALSE |
TRUE |
TRUE |
FALSE |
TRUE |
696 |
FALSE |
TRUE |
TRUE |
TRUE |
FALSE |
16155 |
FALSE |
TRUE |
TRUE |
TRUE |
TRUE |
17331 |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
3294 |
TRUE |
FALSE |
FALSE |
FALSE |
TRUE |
50 |
TRUE |
FALSE |
FALSE |
TRUE |
FALSE |
16606 |
TRUE |
FALSE |
FALSE |
TRUE |
TRUE |
154 |
TRUE |
FALSE |
TRUE |
FALSE |
FALSE |
4571 |
TRUE |
FALSE |
TRUE |
FALSE |
TRUE |
61 |
parse_reads
returned a table which contains some information on the features validity.
Upset plots for each raw original FASTQ file.
report_SRR2017583 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017583_parse_report.csv", show_col_types = FALSE)
report_SRR2017583 %>%
filter(count >=100) %>%
reads_parse_report_ggupset(margin=0.2)
report_SRR2017584 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017584_parse_report.csv", show_col_types = FALSE)
report_SRR2017584 %>%
filter(count >=10) %>%
reads_parse_report_ggupset(margin=0.2)
report_SRR2017585 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017585_parse_report.csv", show_col_types = FALSE)
report_SRR2017585 %>%
filter(count >=100) %>%
reads_parse_report_ggupset(margin = 0.15)
report_SRR2017586 <- read_csv("../extdata/RNaseY_Khemici/parse_results/SRR2017586_parse_report.csv", show_col_types = FALSE)
report_SRR2017586 %>%
filter(count >=100) %>%
reads_parse_report_ggupset(margin = 0.15)
All repotrs combined:
report_SRR2017583 %>%
bind_rows(report_SRR2017584) %>%
bind_rows(report_SRR2017585) %>%
bind_rows(report_SRR2017586) %>%
group_by(across(starts_with("is_valid_"))) %>%
summarise(count = sum(count), .groups = "drop") %>%
filter(count>5000) %>%
reads_parse_report_ggupset(angle = 45, margin=0.35)
Quantification
Mapping
Reads are now ready to be aligned on the reference genome using the map_fastq_files
function.
Note: a single FASTQ file can be mapped with the map_fastq
function.
bf <- map_fastq_files(
list_fastq_files = list.files("../extdata/RNaseY_Khemici/parse_results",
pattern = "_valid.fastq.gz$",
full.names = TRUE) ,
index_name = "../extdata/RNaseY_Khemici/CP010890.bt", # Path of the genome index (created if not exists)
fasta_name = "../extdata/RNaseY_Khemici/CP010890.fasta", # Path of the reference genome file
bam_dir = "../extdata/RNaseY_Khemici/mapping_results", # Output directory
max_map_mismatch = 1,
aligner = "bowtie", # Either bowtie or subread
map_unique = TRUE,
force = FALSE,
threads = 3
)
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
bf
#> [[1]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam"
#>
#> [[2]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam"
#>
#> [[3]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam"
#>
#> [[4]]
#> [1] "../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam"
Mapping statistics
Use the map_stats_bam_files
function to get some basic statistics on the mapping we’ve just made:
map_stats_bam_files(bam_files = bf) %>%
knitr::kable()
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
#> BAM file mapping stats exists '../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam.map_stats.csv' already exists, set force=TRUE to recompute
bam_file |
mapped |
unmapped |
pc_mapped |
---|---|---|---|
SRR2017583_valid.fastq.gz.bam |
1785951 |
1147874 |
60.87 |
SRR2017584_valid.fastq.gz.bam |
585774 |
210122 |
73.60 |
SRR2017585_valid.fastq.gz.bam |
999717 |
477088 |
67.69 |
SRR2017586_valid.fastq.gz.bam |
1074276 |
494215 |
68.49 |
Quantification (counts computation)
In order to obtain counts by position, we use the quantify_from_bam_files
function to parse previously obtained BAM files in bf
.
This function allows the quantification of either the 5’-ends (quantify_5prime
) or the 3’-ends (quantify_3prime
), in both cases, it is an exact position quantification.
For UMIs to be taken into account for the quantification, we provide the read_features
table with the “features” parameter.
The keep_UMI parameter is used to record the number of copy of each UMI by exact position. This level of information is not nedded for this analysis and we use it as default (FALSE).
count_table <- quantify_from_bam_files(
bam_files = bf,
sample_ids = c("WT_rep1","WT_rep2","DY_rep1","DY_rep2"),
fun = quantify_5prime, # Either quantify_5prime or quantify_3prime
features = rf,
keep_UMI = FALSE)
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017583_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017584_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017585_valid.fastq.gz.bam.csv
#> Quantification 5' output file exists, set force=T to overwrite: ../extdata/RNaseY_Khemici/mapping_results/SRR2017586_valid.fastq.gz.bam.csv
count_table %>%
head(15) %>%
knitr::kable()
sample_id |
rname |
position |
strand |
reads |
count |
---|---|---|---|---|---|
WT_rep1 |
CP010890.1 |
313 |
+ |
67 |
34 |
WT_rep1 |
CP010890.1 |
314 |
+ |
7 |
3 |
WT_rep1 |
CP010890.1 |
316 |
+ |
5 |
3 |
WT_rep1 |
CP010890.1 |
362 |
+ |
4 |
3 |
WT_rep1 |
CP010890.1 |
370 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
374 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
377 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
412 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
413 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
437 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
445 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
467 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
491 |
+ |
1 |
1 |
WT_rep1 |
CP010890.1 |
493 |
+ |
9 |
3 |
WT_rep1 |
CP010890.1 |
513 |
+ |
1 |
1 |
The above count table will allow us to perform some downstream analysis.
Note: here, the count column corresponds to the number of different UMIs mapped at this exact position. This is why it is lower or equal to the reads values (due to PCR duplicates).
Downstream analysis
Ribosome profilling
Note: the following functions can take BAM files as input instead of count table. In that case, additional parameters must be provided that correspond to the quantify_from_bam_files
function parameters.
Annotation table preparation
In order to retrieve the number of reads mapped on the CDSs, we need to have access to the position (start and stop) of these CDSs.
We simply retrieved the annotation file of the reference strain SA564 from the NCBI that we filter to get only the feature corresponding to “CDS”.
Note: we also switch the positions for CDS on the reverse strand start and stop as their genomic positions are inverted.
annot <- read_tsv("../extdata/RNaseY_Khemici/CP010890.gff3",
show_col_types = FALSE,
col_names = c("chr",
"type",
"feature",
"begin",
"end",
"V1",
"strand",
"V2",
"annotation")) %>%
filter(feature=="CDS") %>%
mutate(start = ifelse(strand=="-" & begin<end, end, begin),
end = ifelse(strand=="-" & begin<end, begin, end)) %>%
select(chr, feature, strand, start, end, annotation)
annot %>%
head(15) %>%
knitr::kable()
chr |
feature |
strand |
start |
end |
annotation |
---|---|---|---|---|---|
CP010890.1 |
CDS |
+ |
517 |
1878 |
ID=cds-ALD79596.1;Parent=gene-RT87_00005;Dbxref=NCBI_GP:ALD79596.1;Name=ALD79596.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001290432.1;locus_tag=RT87_00005;product=chromosomal replication initiation protein;protein_id=ALD79596.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
2156 |
3289 |
ID=cds-ALD79597.1;Parent=gene-RT87_00010;Dbxref=NCBI_GP:ALD79597.1;Name=ALD79597.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000969809.1;locus_tag=RT87_00010;product=DNA polymerase III subunit beta;protein_id=ALD79597.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
3679 |
3915 |
ID=cds-ALD79598.1;Parent=gene-RT87_00015;Dbxref=NCBI_GP:ALD79598.1;Name=ALD79598.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000249646.1;locus_tag=RT87_00015;product=RNA-binding protein;protein_id=ALD79598.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
3912 |
5024 |
ID=cds-ALD79599.1;Parent=gene-RT87_00020;Dbxref=NCBI_GP:ALD79599.1;Name=ALD79599.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000774109.1;locus_tag=RT87_00020;product=recombinase RecF;protein_id=ALD79599.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
5034 |
6968 |
ID=cds-ALD79600.1;Parent=gene-RT87_00025;Dbxref=NCBI_GP:ALD79600.1;Name=ALD79600.1;Note=negatively supercoils closed circular double-stranded DNA;gbkey=CDS;gene=gyrB;inference=EXISTENCE: similar to AA sequence:SwissProt:Q6GD85.3;locus_tag=RT87_00025;product=DNA gyrase subunit B;protein_id=ALD79600.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
7005 |
9674 |
ID=cds-ALD79601.1;Parent=gene-RT87_00030;Dbxref=NCBI_GP:ALD79601.1;Name=ALD79601.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001561428.1;locus_tag=RT87_00030;product=DNA gyrase subunit A;protein_id=ALD79601.1;transl_table=11 |
CP010890.1 |
CDS |
- |
10574 |
9762 |
ID=cds-ALD79602.1;Parent=gene-RT87_00035;Dbxref=NCBI_GP:ALD79602.1;Name=ALD79602.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q2G2P8.2;locus_tag=RT87_00035;product=carbohydrate kinase;protein_id=ALD79602.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
10900 |
12414 |
ID=cds-ALD79603.1;Parent=gene-RT87_00040;Dbxref=NCBI_GP:ALD79603.1;Name=ALD79603.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q6GKT7.1;locus_tag=RT87_00040;product=histidine ammonia-lyase;protein_id=ALD79603.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
12793 |
14079 |
ID=cds-ALD79604.1;Parent=gene-RT87_00045;Dbxref=NCBI_GP:ALD79604.1;Name=ALD79604.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q2YUR2.1;locus_tag=RT87_00045;product=seryl-tRNA synthetase;protein_id=ALD79604.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
14730 |
15425 |
ID=cds-ALD79605.1;Parent=gene-RT87_00050;Dbxref=NCBI_GP:ALD79605.1;Name=ALD79605.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000190725.1;locus_tag=RT87_00050;product=azaleucine resistance protein AzlC;protein_id=ALD79605.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
15422 |
15751 |
ID=cds-ALD79606.1;Parent=gene-RT87_00055;Dbxref=NCBI_GP:ALD79606.1;Name=ALD79606.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000628045.1;locus_tag=RT87_00055;product=membrane protein;protein_id=ALD79606.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
16114 |
17082 |
ID=cds-ALD79607.1;Parent=gene-RT87_00060;Dbxref=NCBI_GP:ALD79607.1;Name=ALD79607.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001796480.1;locus_tag=RT87_00060;product=homoserine acetyltransferase;protein_id=ALD79607.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
17390 |
18313 |
ID=cds-ALD79608.1;Parent=gene-RT87_00065;Dbxref=NCBI_GP:ALD79608.1;Name=ALD79608.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_000491736.1;locus_tag=RT87_00065;product=membrane protein;protein_id=ALD79608.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
18328 |
20295 |
ID=cds-ALD79609.1;Parent=gene-RT87_00070;Dbxref=NCBI_GP:ALD79609.1;Name=ALD79609.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:RefSeq:WP_001081633.1;locus_tag=RT87_00070;product=hypothetical protein;protein_id=ALD79609.1;transl_table=11 |
CP010890.1 |
CDS |
+ |
20292 |
20738 |
ID=cds-ALD79610.1;Parent=gene-RT87_00075;Dbxref=NCBI_GP:ALD79610.1;Name=ALD79610.1;gbkey=CDS;inference=EXISTENCE: similar to AA sequence:SwissProt:Q2G2T3.1;locus_tag=RT87_00075;product=50S ribosomal protein L9;protein_id=ALD79610.1;transl_table=11 |
Metagene counts
Now that we have the count table and the annotation table, we can compute the metacount profiles around the first nucleotide of the start or stop codon using the metacounts_profile_feature
function.
First, we extract the regions spanning around the feature (START or STOP codon) with metacounts_profile_feature
:
start_tb <- annot %>%
select(rname=chr, strand, position=start)
start_metacounts <- metacounts_feature(count_table, start_tb, width=100)
start_metacounts %>%
head(15) %>%
knitr::kable()
x |
count |
---|---|
-100 |
0.0059492 |
-99 |
0.0057643 |
-98 |
0.0052539 |
-97 |
0.0053888 |
-96 |
0.0060536 |
-95 |
0.0055115 |
-94 |
0.0050401 |
-93 |
0.0056643 |
-92 |
0.0050730 |
-91 |
0.0044526 |
-90 |
0.0050401 |
-89 |
0.0057813 |
-88 |
0.0059047 |
-87 |
0.0050401 |
-86 |
0.0049141 |
Then, we can visualize the metacounts profile:
p_start_metacounts <- start_metacounts %>%
filter(between(x, -42, 75)) %>%
mutate(`Translation frame` = as.factor(x%%3+1)) %>%
ggplot(aes(x = x, y = count, fill = `Translation frame`)) +
geom_bar(stat = "identity", alpha=0.8) +
labs(x = " Distance to the first nucleotide of start codon",
y = "Metacounts") +
scale_x_continuous(breaks = seq(-40, 75, by=10)) +
scale_fill_manual(values = c("#ff5050","#3366ff", "#ffcc00")) +
theme_bw() +
theme(legend.position = "top")
p_start_metacounts
We perform the same for the STOP codon (adjusting for the position of the 1st nucleotide):
stop_tb <- annot %>%
select(rname=chr, strand, position=end) %>%
mutate(position = ifelse(strand=='+', position-2, position+2))
stop_metacounts <- metacounts_feature(count_table, stop_tb, width=100)
stop_metacounts %>%
head(15) %>%
knitr::kable()
x |
count |
---|---|
-100 |
0.0055727 |
-99 |
0.0044619 |
-98 |
0.0068743 |
-97 |
0.0061145 |
-96 |
0.0050028 |
-95 |
0.0060474 |
-94 |
0.0051904 |
-93 |
0.0035142 |
-92 |
0.0063497 |
-91 |
0.0061145 |
-90 |
0.0048556 |
-89 |
0.0064173 |
-88 |
0.0056928 |
-87 |
0.0048775 |
-86 |
0.0066952 |
Metacount profile obtained:
stop_metacounts %>%
filter(between(x, -81, 20)) %>%
mutate(`Translation frame` = as.factor(x%%3+1)) %>%
ggplot(aes(x = x, y = count, fill = `Translation frame`)) +
geom_bar(stat = "identity", alpha=0.8) +
labs(x = " Distance to the first nucleotide of stop codon",
y = "Metacounts") +
scale_fill_manual(values = c("#ff5050","#3366ff", "#ffcc00")) +
scale_x_continuous(breaks = seq(-80,20, by=10)) +
theme_bw() +
theme(legend.position = "top")
Periodicity
From the genome annotation, we can extract regions corresponding to protein coding genes, compute the metacounts profile for all these regions and use FFT to analyse periodicity in the counts.
cds_tb <- annot %>%
select(rname = chr, strand, begin=start, end)
fft_tb <- FFT(count_table = count_table,
region_tb = cds_tb)
fft_tb %>%
head(15) %>%
knitr::kable()
signal |
freq |
period |
---|---|---|
0.1097155 |
137 |
3.007299 |
0.0087828 |
136 |
3.029412 |
0.0079080 |
201 |
2.049751 |
0.0060486 |
197 |
2.091371 |
0.0059655 |
138 |
2.985507 |
0.0056546 |
35 |
11.771429 |
0.0055253 |
57 |
7.228070 |
0.0054040 |
135 |
3.051852 |
0.0053650 |
187 |
2.203209 |
0.0053174 |
53 |
7.773585 |
0.0052531 |
155 |
2.658065 |
0.0052046 |
50 |
8.240000 |
0.0051031 |
32 |
12.875000 |
0.0050365 |
173 |
2.381503 |
0.0049383 |
81 |
5.086420 |
Visualization with ggplot2
p_fft <- fft_tb %>%
filter(period < 7) %>%
ggplot(aes(x = period, y = signal)) +
geom_line(aes()) +
labs(x = "Period", y = "Signal") +
scale_color_brewer(palette="Set1") +
theme_bw()
p_fft
Frame preference
The third analysis provided in this downstream analysis is the count of reads falling into each translation frame (1, 2 or 3). The frame_preference
function takes the same inputs as the 2 previous functions.
frame_pref_tb <- frame_preference(count_table = count_table,
region_tb = cds_tb)
frame_pref_tb %>%
knitr::kable()
frame |
pc |
---|---|
1 |
24.54694 |
2 |
41.39385 |
3 |
34.05922 |
Visualization with ggplot2
p_frame_pref <- frame_pref_tb %>%
ggplot(aes(x = frame, y = pc, fill = as.factor(frame))) +
geom_col(position = "dodge") +
labs(x = " Translation frame", y = "Percentage") +
theme_bw() +
scale_fill_manual(name = "Translation frame", values = c("#ff5050", "#3366ff", "#ffcc00")) +
theme(legend.position = "top")
p_frame_pref
Final plot for the paper:
library(ggpubr)
p1 <- p_start_metacounts + theme(legend.position = "none")
p3 <- p_frame_pref + theme(legend.position = "none")
ggarrange(p1,
ggarrange(p_fft, p3, nrow=2, labels = c("B","C")),
widths = c(3,1),
labels = c("A","")
)