3’-end post-transcriptional modifications exploration and differential proportion analysis

Complete source of this page is available in the notebook Post-transcriptional_modifications.Rmd.

Based on the publication from Xu X, Usher B, Gutierrez C, Barriot R, Arrowsmith TJ, Han X, Redder P, Neyrolles O, Blower TR, Genevaux P (2023) MenT nucleotidyltransferase toxins extend tRNA acceptor stems and can be inhibited by asymmetrical antitoxin binding. Nature Communications, 14, 4644. https://doi.org/10.1038/s41467-023-40264-3.

Download dataset

Data accompanying the article are publicly available on European Nucleotide Archive (ENA) at EBI https://www.ebi.ac.uk/ena/browser/view/PRJEB62085

For this notebook, we will use some RNA-Seq samples multiplexed into one raw FASTQ file to reproduce some results on the action of toxin MenT1 which adds ribonucleotides at the 3’-end of tRNAs as illustrated in figure 3E of the original paper.

The dataset will be directly downloaded in the ../extdata/MenT1 directory of the project (that should be readily available if the GitLab project is cloned).

The initial ../extdata/MenT1 directory contains also the reference sequences of the tRNAs (multifasta file) to which we will align the 3’-end sequencing reads: Msm.reference.tRNAs.with.CCA.fasta.

library(rnaends)

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

options(width = 250)

data_dir <- "../extdata/MenT1"

raw_fastq <- paste0(data_dir, "/Msm.total.RNA.MenT1.Library.fq.gz")

if (!file.exists(raw_fastq))
    download.file("ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR114/ERR11439143/Msm.total.RNA.MenT1.Library.fq.gz",
                  destfile = raw_fastq)

Pre-processing of reads prior mapping

In this section, we will parse the raw reads to:

validate their content
split the FASTQ file into 4 different FASTQ files, each one of them corresponding to a sample: control without toxin or tRNAs with toxin and replicate 1 or 2

For more information on the other pre-processing features available, see the vignette on pre-processing features.

The reads are 100 nt long and their structure can be schematized as follows:

read position: 1   2                      6                      25    40                 46                                   
               N · multiplexing barcode · recognition sequence · UMI · control sequence · 3'-5' tRNA-end reverse complement    
               │   │                      │                       │       │               └ 55nt long                          
               │   │                      │                       │       └ must be AGATAC                                     
               │   │                      │                       └ 15 random nucleotides (A, T, C or G) serving as            
               │   │                      │                         Unique Molecule Identifier (UMI) for PCR duplicates removal
               │   │                      └ must be CGGCACCAACCGAGGCTCA                                                        
               │   └ must be one of ACAC, ATTG, GACG or GGTA                                                                   
               │                                                                                                               
               └ one random nucleotide at the beginning                                                                        

Here, we define the authorized barcodes and their associated experimental condition (in same order) so that we can reuse them later.

barcodes   <- c("ACAC",    "ATTG",    "GACG",     "GGTA")
conditions <- c("ctrl.r1", "ctrl.r2", "MenT1.r1", "MenT1.r2")

The above schematized structure is translated to a read_features table that will be used for reads parsing and validation.

For its initialization, the user must specify which region on the reads contains the sequence destined to be aligned on the reference sequences.

read_features <- init_read_features(start = 46, width = 55) # region to map
read_features %>%
  knitr::kable()

name	start	width	pattern_type	pattern	max_mismatch	readid_prepend
readseq	46	55	nucleotides	ACTG	0	FALSE

Then, the table is updated with each structure element of the reads.

The barcode from position 2 to 5 is a feature of type “constant” which is a set of authorized constant string at a certain location:

read_features <- add_read_feature(read_features,
                                  name = "barcode",
                                  start = 2,
                                  width = 4,
                                  pattern = barcodes,
                                  pattern_type = "constant")

The UMI from position 25 to 39 are composed of A, T, C or G, thus is a feature of type “nucleotides” which is a string composed of a certain alphabet. This feature will be extracted and stored in the read identifier for future use in the quantification step (to remove PCR duplicates), thus the readid_prepend = TRUE argument).

read_features <- add_read_feature(read_features,
                                  name = "UMI",
                                  start = 25,
                                  width = 15,
                                  pattern = "ACGT",
                                  pattern_type =  "nucleotides",
                                  readid_prepend = TRUE)

Then, the recognition sequence and the control sequence correspond to constant character strings and will be used to validate the reads. These are features of type “constant”.

read_features <- add_read_feature(read_features,
                                  name = "recognition_seq",
                                  start = 6,
                                  width = 19,
                                  pattern = "CGGCACCAACCGAGGCTCA",
                                  pattern_type = "constant",
                                  max_mismatch = 1,
                                  readid_prepend = FALSE)
read_features <- add_read_feature(read_features,
                                  name = "control_seq",
                                  start = 40,
                                  width = 6,
                                  pattern = "AGATAC",
                                  pattern_type = "constant",
                                  readid_prepend = FALSE)
read_features %>%
  knitr::kable()

name	start	width	pattern_type	pattern	max_mismatch	readid_prepend
readseq	46	55	nucleotides	ACTG	0	FALSE
barcode	2	4	constant	ACAC, AT….	0	FALSE
UMI	25	15	nucleotides	ACGT	0	TRUE
recognition_seq	6	19	constant	CGGCACCA….	1	FALSE
control_seq	40	6	constant	AGATAC	0	FALSE

Once the read_featurestable is defined, parsing of read can be performed. As we need also to demultiplex the samples here, we will directly calldemultiplexwhich will call theparse_read` internally to parse and validate the reads.

Demultiplex original raw FASTQ file(s) into separate FASTQ files

Reads are validated against the read features and only valid reads are found in the demultiplexed FASTQ files.

demux_report <- demultiplex(fastq_file = raw_fastq,
                            yieldsize = 1e6, # number of reads parsed at each iteration
                            features = read_features,
                            force = FALSE, # demultiplex will not do anything if there already exists a report
                            keep_invalid_reads = FALSE) # FALSE: we do not want invalid reads to be returned in a separate FASTQ file
#> Demux report already exists set force=T to overwrite: ../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/Msm.total.RNA.MenT1.Library_parse_report.csv
demux_report %>%
  knitr::kable()

barcode_group	is_valid_readseq	is_valid_barcode	is_valid_UMI	is_valid_recognition_seq	is_valid_control_seq	count	demux_filename
ACAC	FALSE	TRUE	TRUE	TRUE	TRUE	39	NA
ACAC	TRUE	TRUE	TRUE	TRUE	TRUE	996140	../data/MenT1/Msm.total.RNA.MenT1.Library_demux/Msm.total.RNA.MenT1.Library_ACAC_valid.fastq.gz
ATTG	FALSE	TRUE	TRUE	TRUE	TRUE	35	NA
ATTG	TRUE	TRUE	TRUE	TRUE	TRUE	761196	../data/MenT1/Msm.total.RNA.MenT1.Library_demux/Msm.total.RNA.MenT1.Library_ATTG_valid.fastq.gz
GACG	FALSE	TRUE	TRUE	TRUE	TRUE	25	NA
GACG	TRUE	TRUE	TRUE	TRUE	TRUE	614477	../data/MenT1/Msm.total.RNA.MenT1.Library_demux/Msm.total.RNA.MenT1.Library_GACG_valid.fastq.gz
GGTA	FALSE	TRUE	TRUE	TRUE	TRUE	39	NA
GGTA	TRUE	TRUE	TRUE	TRUE	TRUE	1059530	../data/MenT1/Msm.total.RNA.MenT1.Library_demux/Msm.total.RNA.MenT1.Library_GGTA_valid.fastq.gz

The report provides statistics on reads for each sample together with the fastq output file.

Mapping of reads and quantification of 3’-ends abundances in a count table

As, reads should align partially, we will use subread.

# input
demux_dir <- paste0(sub("(.fastq|.fq)(.gz)?$", "", raw_fastq), "_demux")
fastq_files <- demux_report$demux_filename %>% 
  na.omit()

# params for mapping
reference_genome <- "Msm.reference.tRNAs.with.CCA.fasta"
reference_fasta <- paste0(data_dir, "/", reference_genome)
reference_index <- paste0(data_dir, "/", reference_genome, ".subread")
aligner <- "subread"
max_mismatch <- 0
bam_dir <- paste0(demux_dir,
                  "/mm",
                  max_mismatch,
                  "_",
                  aligner,
                  "_",
                  reference_genome)

# perform mapping
bam_files <- lapply(
  fastq_files,
  map_fastq,
  fasta_name = reference_fasta,
  index_name = reference_index,
  max_map_mismatch = max_mismatch,
  aligner = aligner,
  force = FALSE,
  bam_dir = bam_dir,
  threads = 4
)
#> bam_file '../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_ACAC_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_ATTG_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_GACG_valid.fastq.gz.bam' already exists, set force=TRUE to recompute
#> bam_file '../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_GGTA_valid.fastq.gz.bam' already exists, set force=TRUE to recompute

Convert bam files to count table with `quantify_3prime`

The quantify_3prime is called on each BAM file. The features table is passed to take into account the UMIs that were present in the reads to remove duplicates.

count_table <- quantify_from_bam_files(bam_files,
                                       conditions,
                                       quantify_3prime,
                                       features = read_features,
                                       min_mapped_length = 20,
                                       rev_comp_reads = TRUE,
                                       keep_UMI = FALSE)
#> Quantification 3' output file exists, set force=T to overwrite: ../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_ACAC_valid.fastq.gz.bam.csv
#> Quantification 3' output file exists, set force=T to overwrite: ../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_ATTG_valid.fastq.gz.bam.csv
#> Quantification 3' output file exists, set force=T to overwrite: ../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_GACG_valid.fastq.gz.bam.csv
#> Quantification 3' output file exists, set force=T to overwrite: ../extdata/MenT1/Msm.total.RNA.MenT1.Library_demux/mm0_subread_Msm.reference.tRNAs.with.CCA.fasta/Msm.total.RNA.MenT1.Library_GGTA_valid.fastq.gz.bam.csv
count_table %>%
  head(15) %>%
  knitr::kable()

sample_id	rname	position	strand	downstream_seq	reads	count
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACAG	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACAGGCAGTGTGGGGGCCAGG	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACAGGCAGTGTGGGGGT	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACGGGCAGT	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACGGGCAGTGTGGGGGCCAGGGGTTA	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACGGGCAGTGTGGGGGTCAGG	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACGGGCAGTGTGGGGGTTAGG	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACTGGGAGT	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACTGGTAG	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACTGGTAGT	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACTGGTAGTGTGGGGGTCAGGG	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACTGTCTCCTAGGCTCCACCA	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACACTTGCAGTGTGTGGGTCAGGG	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACAGTGCTGCACGGCGGTCTGATC	1	1
ctrl.r1	Msm.Ala-CGC.trna33	25	+	ACCACAGTGGCAGT	1	1

Downstream analysis: 3’-end post-transciptional modifications exploration

Pivot count_table to have counts for each sample for the same 3’-end. For noise reduction, we will only keep rows for which there are at least 10 reads in one sample.

count_table_w <- count_table %>%
  select(-reads) %>%
  pivot_wider(names_from = "sample_id", values_from = "count", values_fill = 0) %>%
  filter(if_any(all_of(conditions), ~ . >= 10))
count_table_w %>%
  head(15) %>%
  knitr::kable()

rname	position	strand	ctrl.r1	ctrl.r2	MenT1.r1	MenT1.r2
Msm.Ala-CGC.trna33	26	+	137	2	6	145
Msm.Ala-CGC.trna33	27	+	130	12	15	151
Msm.Ala-CGC.trna33	28	+	79	5	12	86
Msm.Ala-CGC.trna33	29	+	75	12	15	93
Msm.Ala-CGC.trna33	30	+	58	27	25	78
Msm.Ala-CGC.trna33	31	+	62	5	8	78
Msm.Ala-CGC.trna33	32	+	36	5	5	41
Msm.Ala-CGC.trna33	33	+	72	31	29	91
Msm.Ala-CGC.trna33	34	+	72	99	106	99
Msm.Ala-CGC.trna33	35	+	142	341	382	166
Msm.Ala-CGC.trna33	36	+	167	44	55	150
Msm.Ala-CGC.trna33	37	+	158	107	95	115
Msm.Ala-CGC.trna33	38	+	66	13	8	59
Msm.Ala-CGC.trna33	39	+	113	18	20	106
Msm.Ala-CGC.trna33	40	+	146	8	6	132

Next, we are going to focus on full length tRNAs. Thus, we will need their length.

tRNAs with expected length from reference sequences FASTA file:

library(Biostrings)
tRNAs_DNASS <- readDNAStringSet(paste0(data_dir, "/", reference_genome))
tRNAs <- tibble(rlength = width(tRNAs_DNASS),
                seq = as.character(tRNAs_DNASS),
                name = names(tRNAs_DNASS)) %>%
  mutate(id = str_extract(name, "trna\\d+.\\S+"),
         rname = as.factor(str_extract(name, "^\\S+"))) %>%
  select(rname, rlength, seq)
tRNAs %>%
  head(15) %>%
  knitr::kable()

rname	rlength	seq
Msm.Ala-CGC.trna33	76	GGGGCTATGGCGCAGTTGGTAGCGCGACTCGTTCGCATCGAGTAGGTCAGGGGTTCGATTCCCCTTAGCTCCACCA
Msm.Ala-GGC.trna9	76	GGGGCTATGGCGCAGTTGGTAGCGCACCACACTGGCAGTGTGGGGGTCAGGGGTTCGAATCCCCTTAGCTCCACCA
Msm.Ala-TGC.trna2	76	GGGGCCTTAGCTCAGTTGGTAGAGCGCTGCCTTTGCAAGGCAGATGTCAGGAGTTCGAATCTCCTAGGCTCCACCA
Msm.Arg-ACG.trna27	76	GCGCCCGTAGCTCAACGGATAGAGCATCTGACTACGGATCAGAAGGTTAGGGGTTCGAATCCCTTCGGGCGCACCA
Msm.Arg-TCT.trna15	76	GCCTCCGTAGCTCAATGGATAGAGCATCGGCCTTCTAATCCGACGGTTGCAGGTTCGAGTCCTGCCGGGGGCGCCA
Msm.Arg-CCG.trna18	76	GCCCCCGTAGCTCAGGGGATAGAGCGTCTGCCTCCGGAGCAGAAGGCCGCAGGTTCGAATCCTGCCGGGGGCACCA
Msm.Arg-CCT.trna20	76	GCCCTCGTAGCTCAGGGGATAGAGCACGGCTCTCCTAAAGCCGGTGTCGCAGGTTCGAATCCTGCCGGGGGCACCA
Msm.Asn-GTT.trna37	76	TCCCCTGTAGCTCAATTGGCAGAGCATTCGGCTGTTAACCGAAGGGTTGCTGGTTCGAGTCCAGCCGGGGGAGCCA
Msm.Asp-GTC.trna31	77	GGCCCTGTGGCGCAGTTGGTTAGCGCGCCGCCCTGTCACGGCGGAGGTCGCGGGTTCAAGTCCCGTCAGGGTCGCCA
Msm.Cys-GCA.trna40	74	GCGGGTGTGGCTGAGTGGCTAGGCAGCGGCCTGCAAAGCCGTCTACACGGGTTCGAGTCCCGTCACTCGCTCCA
Msm.Cys-GCA.trna44	74	GGTGGAGTGGCCGAGTGGTGAGGCAACGGCCTGCAAAGCCGTGCACACGGGTTCGATTCCCGTCTCCACCTCCA
Msm.Gln-CTG.trna10	75	TGGGGTATGGTGTAATTGGCAACACAGCTGATTCTGGTTCAGCCATTCTAGGTTCGAGTCCTGGTACCCCAGCCA
Msm.Gln-TTG.trna19	75	TCCGTCGTGGTGTAATCGGCAGCACCTCTGATTTTGGTTCAGATAGTTCAGGTTCGAGTCCTGGCGACGGAGCCA
Msm.Glu-CTC.trna11	76	GCCCCCGTCGTCTAGCGGCCTAGGACGCCGCCCTCTCACGGCGGTAGCGTGGGTTCGAATCCCATCGGGGGTACCA
Msm.Glu-TTC.trna30	76	GCCCCCATCGTCTAGTGGCCTAGGACGCCGCCCTTTCACGGCGGTAGCACGGGTTCGAATCCCGTTGGGGGTACCA

Filter only full length tRNAs with or without Cs added post-transcriptionaly

count_table_w <- tRNAs %>%
  select(-seq) %>%
  inner_join(count_table_w, by = join_by(rname)) %>%
  filter(position == rlength, str_detect(downstream_seq, pattern = "^C*$"))
count_table_w %>%
  knitr::kable()

rname	rlength	position	strand	downstream_seq	ctrl.r1	ctrl.r2	MenT1.r1	MenT1.r2
Msm.Ala-CGC.trna33	76	76	+		74878	35268	14950	49270
Msm.Ala-CGC.trna33	76	76	+	C	48	6	5062	23377
Msm.Ala-CGC.trna33	76	76	+	CC	251	50	774	9423
Msm.Ala-CGC.trna33	76	76	+	CCC	132	22	49	655
Msm.Ala-CGC.trna33	76	76	+	CCCC	12	1	6	50
Msm.Ala-GGC.trna9	76	76	+		117766	46916	23538	91997
Msm.Ala-GGC.trna9	76	76	+	C	60	9	8894	38132
Msm.Ala-GGC.trna9	76	76	+	CC	273	49	992	10477
Msm.Ala-GGC.trna9	76	76	+	CCC	135	23	75	722
Msm.Ala-GGC.trna9	76	76	+	CCCC	25	2	4	60
Msm.Ala-TGC.trna2	76	76	+		3696	1322	660	2249
Msm.Ala-TGC.trna2	76	76	+	C	3	0	180	529
Msm.Ala-TGC.trna2	76	76	+	CC	4	0	11	133
Msm.Arg-CCG.trna18	76	76	+		19830	39402	24566	20289
Msm.Arg-CCG.trna18	76	76	+	C	37	3	10869	10466
Msm.Arg-CCG.trna18	76	76	+	CC	27	8	313	185
Msm.Arg-CCG.trna18	76	76	+	CCC	10	2	21	15
Msm.Arg-CCT.trna20	76	76	+		2006	1427	871	1834
Msm.Arg-CCT.trna20	76	76	+	CC	4	0	25	79
Msm.Arg-CCT.trna20	76	76	+	C	0	0	322	593
Msm.Asn-GTT.trna37	76	76	+		5045	1582	1101	3803
Msm.Asn-GTT.trna37	76	76	+	CC	6	1	22	157
Msm.Asn-GTT.trna37	76	76	+	C	0	0	188	925
Msm.Asp-GTC.trna31	77	77	+		1187	430	161	490
Msm.Asp-GTC.trna31	77	77	+	CC	1	0	4	13
Msm.Asp-GTC.trna31	77	77	+	C	0	0	51	143
Msm.Cys-GCA.trna44	74	74	+		15287	2177	1418	12162
Msm.Cys-GCA.trna44	74	74	+	C	14	5	742	7331
Msm.Cys-GCA.trna44	74	74	+	CC	19	5	52	736
Msm.Cys-GCA.trna44	74	74	+	CCC	19	2	0	35
Msm.Gln-CTG.trna10	75	75	+		2025	1696	1279	1536
Msm.Gln-CTG.trna10	75	75	+	C	1	1	208	289
Msm.Gln-CTG.trna10	75	75	+	CC	7	1	17	51
Msm.Glu-CTC.trna11	76	76	+		369	454	154	190
Msm.Glu-CTC.trna11	76	76	+	C	0	0	48	20
Msm.Glu-TTC.trna30	76	76	+		20182	48205	20106	10764
Msm.Glu-TTC.trna30	76	76	+	C	17	3	5826	1919
Msm.Glu-TTC.trna30	76	76	+	CC	19	4	671	334
Msm.Glu-TTC.trna30	76	76	+	CCC	2	0	57	12
Msm.Gly-GCC.trna43	76	76	+		323	68	27	139
Msm.Gly-GCC.trna43	76	76	+	C	0	0	5	37
Msm.Gly-TCC.trna14	74	74	+		14279	7436	3028	13502
Msm.Gly-TCC.trna14	74	74	+	C	25	2	2041	8300
Msm.Gly-TCC.trna14	74	74	+	CC	59	10	165	821
Msm.Gly-TCC.trna14	74	74	+	CCC	50	15	5	76
Msm.Gly-TCC.trna14	74	74	+	CCCC	33	2	5	34
Msm.Gly-TCC.trna14	74	74	+	CCCCC	4	1	1	12
Msm.His-GTG.trna16	76	76	+		227	35	16	84
Msm.His-GTG.trna16	76	76	+	C	0	0	2	32
Msm.Ile-GAT.trna1	77	77	+		22255	11100	7542	20170
Msm.Ile-GAT.trna1	77	77	+	C	11	1	1168	4340
Msm.Ile-GAT.trna1	77	77	+	CC	16	6	118	645
Msm.Ile-GAT.trna1	77	77	+	CCC	7	0	5	34
Msm.Leu-CAA.trna42	77	77	+		13281	3553	1655	7642
Msm.Leu-CAA.trna42	77	77	+	C	7	2	578	2357
Msm.Leu-CAA.trna42	77	77	+	CC	10	0	42	223
Msm.Leu-CAA.trna42	77	77	+	CCC	7	2	5	17
Msm.Leu-CAG.trna3	86	86	+		103054	76258	42932	70265
Msm.Leu-CAG.trna3	86	86	+	C	50	12	16217	20681
Msm.Leu-CAG.trna3	86	86	+	CC	93	22	1178	2112
Msm.Leu-CAG.trna3	86	86	+	CCC	55	7	59	209
Msm.Leu-CAG.trna3	86	86	+	CCCC	6	8	3	20
Msm.Leu-GAG.trna13	89	89	+		12366	17017	11314	12197
Msm.Leu-GAG.trna13	89	89	+	C	16	2	3619	2905
Msm.Leu-GAG.trna13	89	89	+	CC	15	0	476	559
Msm.Leu-GAG.trna13	89	89	+	CCC	29	5	32	67
Msm.Leu-GAG.trna13	89	89	+	CCCC	5	2	5	12
Msm.Lys-CTT.trna17	76	76	+		16554	8047	5428	10114
Msm.Lys-CTT.trna17	76	76	+	C	11	0	1049	1716
Msm.Lys-CTT.trna17	76	76	+	CC	21	2	131	517
Msm.Lys-CTT.trna17	76	76	+	CCC	9	1	11	48
Msm.Lys-TTT.trna21	76	76	+		1532	668	382	1137
Msm.Lys-TTT.trna21	76	76	+	CC	3	0	6	37
Msm.Lys-TTT.trna21	76	76	+	C	0	0	81	234
Msm.Met-CAT.trna8	77	77	+		38	8	4	21
Msm.Met-CAT.trna6	77	77	+		20	4	2	4
Msm.Pro-CGG.trna23	77	77	+		3601	2261	1796	3718
Msm.Pro-CGG.trna23	77	77	+	C	2	1	337	758
Msm.Pro-CGG.trna23	77	77	+	CC	2	0	13	57
Msm.Pro-GGG.trna41	77	77	+		73943	65589	36733	57991
Msm.Pro-GGG.trna41	77	77	+	C	50	6	8934	21092
Msm.Pro-GGG.trna41	77	77	+	CC	141	26	687	3624
Msm.Pro-GGG.trna41	77	77	+	CCC	103	16	44	220
Msm.Pro-GGG.trna41	77	77	+	CCCC	9	5	5	25
Msm.Pro-TGG.trna36	77	77	+		37	37	7	18
Msm.SeC-TCA.trna46	95	95	+		199	322	166	187
Msm.SeC-TCA.trna46	95	95	+	C	1	0	46	46
Msm.SeC-TCA.trna46	95	95	+	CC	0	0	4	10
Msm.Ser-CGA.trna28	91	91	+		190	36	18	147
Msm.Ser-CGA.trna28	91	91	+	C	0	0	3	34
Msm.Ser-GCT.trna26	92	92	+		523	300	104	523
Msm.Ser-GCT.trna26	92	92	+	C	0	0	10	93
Msm.Ser-GGA.trna24	89	89	+		15842	6155	2893	11747
Msm.Ser-GGA.trna24	89	89	+	C	11	2	709	4548
Msm.Ser-GGA.trna24	89	89	+	CC	4	1	49	504
Msm.Ser-GGA.trna24	89	89	+	CCC	0	3	3	22
Msm.Ser-TGA.trna25	90	90	+		26	13	3	21
Msm.Thr-CGT.trna29	76	76	+		319	103	34	157
Msm.Thr-CGT.trna29	76	76	+	C	1	0	7	41
Msm.Thr-GGT.trna5	76	76	+		11120	10371	5789	5106
Msm.Thr-GGT.trna5	76	76	+	C	11	0	1327	1660
Msm.Thr-GGT.trna5	76	76	+	CC	8	1	75	170
Msm.Thr-TGT	75	75	+		76	16	2	47
Msm.Thr-TGT	75	75	+	C	0	0	0	37
Msm.Trp-CCA.trna7	76	76	+		104	13	8	32
Msm.Tyr-GTA.trna4	86	86	+		411	127	83	273
Msm.Tyr-GTA.trna4	86	86	+	C	0	0	28	97
Msm.Tyr-GTA.trna4	86	86	+	CC	0	0	3	27
Msm.Val-CAC.trna12	75	75	+		952	540	327	916
Msm.Val-CAC.trna12	75	75	+	C	2	0	45	133
Msm.Val-CAC.trna12	75	75	+	CC	1	0	5	40
Msm.Val-GAC.trna45	75	75	+		54835	25926	17708	51804
Msm.Val-GAC.trna45	75	75	+	C	13	0	2205	7261
Msm.Val-GAC.trna45	75	75	+	CC	61	9	212	1511
Msm.Val-GAC.trna45	75	75	+	CCC	17	5	15	68

Visualize proportions of post-transcriptional modifications with bar plots

Pivot to longer format to compute proportions per tRNA per condition

count_table_l <- count_table_w %>%
  pivot_longer(cols = all_of(conditions),
               names_to = "condition",
               values_to = "count")
count_per_tRNA_per_sample <- count_table_l %>%
  group_by(condition, rname) %>%
  summarise(total=sum(count), .groups = "keep") %>%
  select(condition, rname, total)
# totals and percentages
count_table_l <- count_table_l  %>%
  inner_join(count_per_tRNA_per_sample, by = c("condition", "rname")) %>%
  mutate(pc = round(count / total * 100, 1)) %>%
  relocate(pc, .after = count)
count_table_l %>%
  head(15) %>%
  knitr::kable()

rname	rlength	position	strand	downstream_seq	condition	count	pc	total
Msm.Ala-CGC.trna33	76	76	+		ctrl.r1	74878	99.4	75321
Msm.Ala-CGC.trna33	76	76	+		ctrl.r2	35268	99.8	35347
Msm.Ala-CGC.trna33	76	76	+		MenT1.r1	14950	71.7	20841
Msm.Ala-CGC.trna33	76	76	+		MenT1.r2	49270	59.5	82775
Msm.Ala-CGC.trna33	76	76	+	C	ctrl.r1	48	0.1	75321
Msm.Ala-CGC.trna33	76	76	+	C	ctrl.r2	6	0.0	35347
Msm.Ala-CGC.trna33	76	76	+	C	MenT1.r1	5062	24.3	20841
Msm.Ala-CGC.trna33	76	76	+	C	MenT1.r2	23377	28.2	82775
Msm.Ala-CGC.trna33	76	76	+	CC	ctrl.r1	251	0.3	75321
Msm.Ala-CGC.trna33	76	76	+	CC	ctrl.r2	50	0.1	35347
Msm.Ala-CGC.trna33	76	76	+	CC	MenT1.r1	774	3.7	20841
Msm.Ala-CGC.trna33	76	76	+	CC	MenT1.r2	9423	11.4	82775
Msm.Ala-CGC.trna33	76	76	+	CCC	ctrl.r1	132	0.2	75321
Msm.Ala-CGC.trna33	76	76	+	CCC	ctrl.r2	22	0.1	35347
Msm.Ala-CGC.trna33	76	76	+	CCC	MenT1.r1	49	0.2	20841

palettefig <- c(
  `3'-`     = "#6696D0",
  `3'-C`    = "#E46D32",
  `3'-CC`   = "#B82123",
  `3'-CCC`  = "#F3B81B",
  `3'-CCCC` = "#D6D65F"
)

tRNA_labs = count_table_l$rname
labs = tRNA_labs %>% as.character %>% unique %>% str_sub(5,11)
names(labs) = tRNA_labs %>% as.character %>% unique 

count_table_l %>%
  mutate(rname = fct_relevel(rname,
                             sort(unique(as.character(rname)),
                                  decreasing = FALSE)),
         condition = fct_relevel(condition,
                                 sort(unique(as.character(condition)),
                                      decreasing = TRUE)),
         downstream_seq = paste0("3'-", downstream_seq)) %>%
  ggplot(aes(fill = downstream_seq, y = pc, x = condition)) +
    geom_bar(position = "stack", stat = "identity") +
    coord_flip() +
    scale_fill_manual(values = palettefig) +
    geom_text(aes(condition, 50, label = total, fill = NULL), size = 3) +
    theme_classic() +
    xlab(NULL) +
    ylab(NULL) +
    facet_wrap(facets = . ~ rname, ncol = 6, labeller = labeller(rname = labs)) +
    labs(fill = "3'-end modification")

Differential proportion analysis: which tRNAs are significantly modified by toxin?

Are the observed differences in proportions of 3’-end modified transcripts statistically significant?

For this, we will compare the proportion of modified tRNAs in the conrtol vs. the proportion of modified tRNAs in the samples treated with MenT1.

In the following, we regroup the modified tRNAs as the “fraction” and we sum up unmodified and modified tRNAs in the “total” as expected by EMOTE_differential_proportion function.

proportion_table <- count_table_l %>%
  select(-rlength) %>%
  mutate(count_of = ifelse(downstream_seq == "", "total", "fraction")) %>%
  group_by(rname, strand, position, condition, count_of) %>%
  summarise(count = sum(count), .groups = "keep") %>%
  separate(condition, into = c("group", "replicate")) %>%
  pivot_wider(names_from = "count_of", values_from = "count", values_fill=0) %>%
  mutate(total = total + fraction) %>% # for dss which expects total and fraction
  pivot_longer(cols = c("fraction", "total"),
               names_to = "count_of",
               values_to = "count") %>%
  mutate(sample_id = paste0(group, ".", replicate, ".", count_of)) %>%
  select(-group, -replicate, -count_of)
proportion_table %>%
  head(15) %>%
  knitr::kable()

rname	strand	position	count	sample_id
Msm.Ala-CGC.trna33	+	76	5891	MenT1.r1.fraction
Msm.Ala-CGC.trna33	+	76	20841	MenT1.r1.total
Msm.Ala-CGC.trna33	+	76	33505	MenT1.r2.fraction
Msm.Ala-CGC.trna33	+	76	82775	MenT1.r2.total
Msm.Ala-CGC.trna33	+	76	443	ctrl.r1.fraction
Msm.Ala-CGC.trna33	+	76	75321	ctrl.r1.total
Msm.Ala-CGC.trna33	+	76	79	ctrl.r2.fraction
Msm.Ala-CGC.trna33	+	76	35347	ctrl.r2.total
Msm.Ala-GGC.trna9	+	76	9965	MenT1.r1.fraction
Msm.Ala-GGC.trna9	+	76	33503	MenT1.r1.total
Msm.Ala-GGC.trna9	+	76	49391	MenT1.r2.fraction
Msm.Ala-GGC.trna9	+	76	141388	MenT1.r2.total
Msm.Ala-GGC.trna9	+	76	493	ctrl.r1.fraction
Msm.Ala-GGC.trna9	+	76	118259	ctrl.r1.total
Msm.Ala-GGC.trna9	+	76	83	ctrl.r2.fraction

Then, we need a design_matrix to specify the group and replicate for each sample_id and also if the counts in the count_table corresponds to the fraction or the total.

design_matrix <- tibble(sample_id = unique(sort(proportion_table$sample_id))) %>%
  separate(sample_id,
           sep = "\\.",
           into = c("group", "replicate", "count_of"),
           remove = FALSE)
design_matrix %>%
  knitr::kable()

sample_id	group	replicate	count_of
ctrl.r1.fraction	ctrl	r1	fraction
ctrl.r1.total	ctrl	r1	total
ctrl.r2.fraction	ctrl	r2	fraction
ctrl.r2.total	ctrl	r2	total
MenT1.r1.fraction	MenT1	r1	fraction
MenT1.r1.total	MenT1	r1	total
MenT1.r2.fraction	MenT1	r2	fraction
MenT1.r2.total	MenT1	r2	total

diff_prop <- differential_proportion(proportion_table,
                                     design_matrix,
                                     group1 = "ctrl",
                                     group2 = "MenT1")
#> Estimating dispersion for each CpG site, this will take a while ...
diff_prop %>%
  knitr::kable()

rname	position	strand	ctrl	MenT1	pval	fdr
Msm.Gly-TCC.trna14	74	+	0.0079261	0.4145317	0.0000000	0.0000000
Msm.Cys-GCA.trna44	74	+	0.0044360	0.3793868	0.0000000	0.0000000
Msm.Arg-CCG.trna18	76	+	0.0020238	0.3288844	0.0000000	0.0000000
Msm.Ala-GGC.trna9	76	+	0.0029674	0.3233828	0.0000000	0.0000000
Msm.Arg-CCT.trna20	76	+	0.0009950	0.2765248	0.0000000	0.0000000
Msm.Leu-CAA.trna42	77	+	0.0014642	0.2638804	0.0000000	0.0000001
Msm.Leu-CAG.trna3	86	+	0.0013089	0.2679313	0.0000000	0.0000002
Msm.Tyr-GTA.trna4	86	+	0.0000000	0.2921362	0.0000001	0.0000002
Msm.Asp-GTC.trna31	77	+	0.0004209	0.2480578	0.0000002	0.0000009
Msm.Leu-GAG.trna13	89	+	0.0028787	0.2463040	0.0000002	0.0000009
Msm.Ala-TGC.trna2	76	+	0.0009452	0.2259275	0.0000003	0.0000010
Msm.Ala-CGC.trna33	76	+	0.0040582	0.3437180	0.0000005	0.0000016
Msm.SeC-TCA.trna46	95	+	0.0025000	0.2309671	0.0000018	0.0000052
Msm.Thr-GGT.trna5	76	+	0.0009011	0.2294034	0.0000039	0.0000103
Msm.Lys-CTT.trna17	76	+	0.0014216	0.1819812	0.0000053	0.0000131
Msm.Pro-GGG.trna41	77	+	0.0024442	0.2546503	0.0000060	0.0000138
Msm.Lys-TTT.trna21	76	+	0.0009772	0.1889863	0.0000068	0.0000149
Msm.Ser-GGA.trna24	89	+	0.0009599	0.2549558	0.0000135	0.0000277
Msm.Pro-CGG.trna23	77	+	0.0007758	0.1714434	0.0000150	0.0000293
Msm.Thr-TGT	75	+	0.0000000	0.2202381	0.0000180	0.0000333
Msm.Glu-TTC.trna30	76	+	0.0010123	0.2098397	0.0000256	0.0000450
Msm.Gln-CTG.trna10	75	+	0.0025565	0.1654189	0.0000555	0.0000934
Msm.Asn-GTT.trna37	76	+	0.0009098	0.1908387	0.0000634	0.0001021
Msm.Ile-GAT.trna1	77	+	0.0010778	0.1727051	0.0001006	0.0001551
Msm.Thr-CGT.trna29	76	+	0.0015625	0.1889012	0.0001962	0.0002904
Msm.Val-CAC.trna12	75	+	0.0015707	0.1457437	0.0002308	0.0003284
Msm.Val-GAC.trna45	75	+	0.0010982	0.1332617	0.0002644	0.0003624
Msm.Gly-GCC.trna43	76	+	0.0000000	0.1832386	0.0003323	0.0004391
Msm.His-GTG.trna16	76	+	0.0000000	0.1934866	0.0009273	0.0011831
Msm.Ser-CGA.trna28	91	+	0.0000000	0.1653512	0.0011126	0.0013722
Msm.Glu-CTC.trna11	76	+	0.0000000	0.1664309	0.0042341	0.0050001
Msm.Ser-GCT.trna26	92	+	0.0000000	0.1193467	0.0043244	0.0050001
Msm.Met-CAT.trna6	77	+	0.0000000	0.0000000	0.7403609	0.8218854
Msm.Met-CAT.trna8	77	+	0.0000000	0.0000000	0.8104833	0.8218854
Msm.Pro-TGG.trna36	77	+	0.0000000	0.0000000	0.8057931	0.8218854
Msm.Ser-TGA.trna25	90	+	0.0000000	0.0000000	0.8115274	0.8218854
Msm.Trp-CCA.trna7	76	+	0.0000000	0.0000000	0.8218854	0.8218854