Pre-processing functions (parse_reads or demultiplex) of 5’/3’-ends RNA-Seq data

Complete source of this page is available in the notebook preprocessing.Rmd.

library(tidyverse)
library(rnaends)

The 3 datasets required for this notebook are provided in the extdata directory of the GitLab project.

Extraction of the mappable part of reads

To have a look on raw reads we want to parse, we first need to load the ShortRead package:

library(ShortRead)

Then, we inspect the reads:

fq_streamer <- FastqStreamer("../extdata/pre-processing_examples/Example_1.fastq.gz")
sr <- yield(fq_streamer)
close(fq_streamer)
sr@sread
#> DNAStringSet object of length 10000:
#>         width seq
#>     [1]    33 AGGAGAAGAGCGGTTCAGCAGGAATGCCGAGAC
#>     [2]    33 AGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
#>     [3]    33 AGGCCAGCGACGCGAAGTAGAATCAGTAATTTG
#>     [4]    33 AGGCAAGGAACGCCATGCGAGAGCGGTATTATC
#>     [5]    33 AGGGGTTAAGTTATTAAGGGCGCACGGTGGATG
#>     ...   ... ...
#>  [9996]    33 AGGGCAAAAACGCGCACAAAAAATGACCAAGAA
#>  [9997]    33 AGGAGGGACAGCACCGCTCTTCCGATCTTAAGC
#>  [9998]    33 AGGGTCCCCCGCTTATTGATATGCAAGATGAAG
#>  [9999]    33 AGGCGAGGCACGCTCAGTCAAGCTGATTTAAAT
#> [10000]    33 AGGGGGCGCGCGCTAAAAGCTACGCACGTTTTT

For this example we have 10k reads that are 33 nucleotides long that should be built that way:

            
        1 3         4      10     11 13      14                  33
        AGG         - VVVVVVV -    CGC      - XXXXXXXXXXXXXXXXXXXXX
recognition.seq     -   UMI   - control.seq -  RNA 5'-end (readseq)
  • The 3 first nucleotides correspond to a recognition.sequence which should always be AGG

  • The 7 next nucleotides are a random sequence constituting a Unique Molecule Identifier (UMI) where the only constraint is that it must not contain any T

  • Then it should be an exact sequence again (control.seq) that corresponds to CGC

  • Finally, we have the 5’-end of the transcript

In order to extract only the part that corresponds to the 5’-end of the transcript, we have to build a read_features table using the init_read_features function:

rf <- init_read_features(start = 14, width = 19)

By default, only A, C, T or G are allowed in the nucleotidic sequence. It is possible to change the number of mismatch allowed as well as the allowed nucleotides/characters of the readseq sequence by modifying some parameters as follows:

init_read_features(start = 14, width = 19, pattern = "ACG", max_mismatch = 1)
#> # A tibble: 1 × 7
#> # Rowwise: 
#>   name    start width pattern_type pattern max_mismatch readid_prepend
#>   <chr>   <dbl> <dbl> <chr>        <chr>          <dbl> <lgl>         
#> 1 readseq    14    19 nucleotides  ACG                1 FALSE
parse_reads(fastq_file  = "../extdata/pre-processing_examples/Example_1.fastq.gz",
            features = rf,
            force = TRUE)
#> # A tibble: 2 × 2
#>   is_valid_readseq count
#>   <lgl>            <int>
#> 1 FALSE               17
#> 2 TRUE              9983

We observe that 17 reads do not pass the ACGT (default) check for the allowed nucleotides in the readseq pattern.

fq_streamer = FastqStreamer("../extdata/pre-processing_examples/Example_1_demux/Example_1_valid.fastq.gz")
sr <- yield(fq_streamer)
close(fq_streamer)
sr@sread
#> DNAStringSet object of length 9983:
#>        width seq
#>    [1]    19 TTCAGCAGGAATGCCGAGA
#>    [2]    19 CAGCAGGAATGCCGAGACC
#>    [3]    19 GAAGTAGAATCAGTAATTT
#>    [4]    19 CATGCGAGAGCGGTATTAT
#>    [5]    19 TTAAGGGCGCACGGTGGAT
#>    ...   ... ...
#> [9979]    19 GCACAAAAAATGACCAAGA
#> [9980]    19 CCGCTCTTCCGATCTTAAG
#> [9981]    19 TATTGATATGCAAGATGAA
#> [9982]    19 TCAGTCAAGCTGATTTAAA
#> [9983]    19 TAAAAGCTACGCACGTTTT

We see that only the sequences from position 14 to 33 are present in the FASTQ output file.

Extraction of the mappable part of valid reads

In addition to what we have done before, we can extract the 5’-ends of the transcripts only for the reads which are valid by testing the validity of each expected elements described earlier.

To do that, we add some features to the read_features table with the add_read_feature function:

rf <- add_read_feature(rf,
                       name = "recognition.seq",
                       start = 1,
                       width = 3,
                       pattern = "AGG",
                       pattern_type = "constant")
rf <- add_read_feature(rf,
                       name = "control.seq",
                       start = 11,
                       width = 3,
                       pattern = "CGC",
                       pattern_type = "constant")
rf %>% 
  print
#> # A tibble: 3 × 7
#> # Rowwise: 
#>   name            start width pattern_type pattern max_mismatch readid_prepend
#>   <chr>           <dbl> <dbl> <chr>        <chr>          <dbl> <lgl>         
#> 1 readseq            14    19 nucleotides  ACTG               0 FALSE         
#> 2 recognition.seq     1     3 constant     AGG                0 FALSE         
#> 3 control.seq        11     3 constant     CGC                0 FALSE

Following the read structure described, we define the start, the width and the expected sequence of the 2 features recognition.seq and control.seq. As these features should have an exact sequence at the given positions, we set the pattern_type parameters = constant

In fact the pattern_type parameters can have 3 values:

  • nucleotides if the pattern is a string of allowed characters

  • constant if the pattern is an exact sequence (or a vector of exact sequence)

  • regexp if the pattern is a regular expression which, once spotted, is trimmed with whatever is behind it.

Then we will add the last feature we want to check which is the UMI:

rf <- add_read_feature(rf,
                       name = "UMI",
                       start = 4,
                       width = 7,
                       pattern = "ACG",
                       pattern_type = "nucleotides",
                       readid_prepend = TRUE)
rf %>% 
  print
#> # A tibble: 4 × 7
#> # Rowwise: 
#>   name            start width pattern_type pattern max_mismatch readid_prepend
#>   <chr>           <dbl> <dbl> <chr>        <chr>          <dbl> <lgl>         
#> 1 readseq            14    19 nucleotides  ACTG               0 FALSE         
#> 2 recognition.seq     1     3 constant     AGG                0 FALSE         
#> 3 control.seq        11     3 constant     CGC                0 FALSE         
#> 4 UMI                 4     7 nucleotides  ACG                0 TRUE

As for other features we set the name, start and width of the feature.
As this feature is a random sequence with a list of allowed characters (A, C or G but not T) we set the pattern_type to nucleotides and the pattern to “ACG”.

We also set readid_prepend to TRUE in order to put the identified UMI in the read identifier (we will check this later).

Now that we have defined all the features, we expect to find along the reads, we can call (as in the previous example) the parse_reads function to extract the readseq subsequence of reads that have all the features valid.

report <- parse_reads(fastq_file  = "../extdata/pre-processing_examples/Example_1.fastq.gz",
                      features = rf,
                      force = TRUE)
report
#> # A tibble: 10 × 5
#>    is_valid_readseq is_valid_recognition.seq is_valid_control.seq is_valid_UMI count
#>    <lgl>            <lgl>                    <lgl>                <lgl>        <int>
#>  1 FALSE            TRUE                     FALSE                FALSE            4
#>  2 FALSE            TRUE                     FALSE                TRUE            10
#>  3 FALSE            TRUE                     TRUE                 TRUE             3
#>  4 TRUE             FALSE                    FALSE                FALSE            7
#>  5 TRUE             FALSE                    FALSE                TRUE            45
#>  6 TRUE             FALSE                    TRUE                 TRUE             1
#>  7 TRUE             TRUE                     FALSE                FALSE          725
#>  8 TRUE             TRUE                     FALSE                TRUE          4296
#>  9 TRUE             TRUE                     TRUE                 FALSE           89
#> 10 TRUE             TRUE                     TRUE                 TRUE          4820

The parse_reads function returns a parse report that provides some statistics about the validity of the features checked.

For example here, we see that among the 10k input reads, 4820 were fully valid while 5163 were invalid due to an invalid feature.

We also see that most of the invalid reads have a valid recognition.seq and a valid UMI but very few reads have a valid control.seq indicating that the main reason for the invalid status of reads is due to the invalidity of the control.seq.

We can summarise these with an upset plot:

report %>%
  reads_parse_report_ggupset()

In the generated FASTQ, if we see that reads only have the part of the read that corresponds to the readseq sequence:

fq_streamer <- FastqStreamer("../extdata/pre-processing_examples/Example_1_demux/Example_1_valid.fastq.gz")
sr <- yield(fq_streamer)
close(fq_streamer)
sr@sread
#> DNAStringSet object of length 4820:
#>        width seq
#>    [1]    19 GAAGTAGAATCAGTAATTT
#>    [2]    19 CATGCGAGAGCGGTATTAT
#>    [3]    19 TTCTTCCTACTAAGAACAC
#>    [4]    19 TCCTCCGCTTATTGATATG
#>    [5]    19 AAACGTTCAGCATTAAGTA
#>    ...   ... ...
#> [4816]    19 ATGTAAGTTATTTAAATAA
#> [4817]    19 CAAAACGTTAAGCGAATAA
#> [4818]    19 GCACAAAAAATGACCAAGA
#> [4819]    19 TCAGTCAAGCTGATTTAAA
#> [4820]    19 TAAAAGCTACGCACGTTTT

In the generated FASTQ, the UMI sequence was added at the beginning of each read id for later use to remove PCR duplicates:

sr@id
#> BStringSet object of length 4820:
#>        width seq
#>    [1]    73 CCAGCGA:SRR2017586.3.1 HWI-ST865:389:HAUHUADXX:1:1101:1718:2167 length=50
#>    [2]    73 CAAGGAA:SRR2017586.4.1 HWI-ST865:389:HAUHUADXX:1:1101:1598:2170 length=50
#>    [3]    73 AGCGAGA:SRR2017586.7.1 HWI-ST865:389:HAUHUADXX:1:1101:1926:2194 length=50
#>    [4]    73 AGGCGAA:SRR2017586.8.1 HWI-ST865:389:HAUHUADXX:1:1101:2348:2120 length=50
#>    [5]    73 ACAGGAG:SRR2017586.9.1 HWI-ST865:389:HAUHUADXX:1:1101:2374:2223 length=50
#>    ...   ... ...
#> [4816]    77 AACAGCA:SRR2017586.9991.1 HWI-ST865:389:HAUHUADXX:1:1101:4623:12299 length=50
#> [4817]    77 GAAGAAA:SRR2017586.9992.1 HWI-ST865:389:HAUHUADXX:1:1101:4633:12362 length=50
#> [4818]    77 GCAAAAA:SRR2017586.9996.1 HWI-ST865:389:HAUHUADXX:1:1101:4779:12322 length=50
#> [4819]    77 CGAGGCA:SRR2017586.9999.1 HWI-ST865:389:HAUHUADXX:1:1101:5003:12298 length=50
#> [4820]    78 GGGCGCG:SRR2017586.10000.1 HWI-ST865:389:HAUHUADXX:1:1101:5227:12298 length=50

Extraction of the mappable part of reads according to a barcode sequence

In order to multiplex multiple experimental conditions into one sequencing run, it is possible to ligate a barcode sequence to the ends of the transcript.

For this example we have 10000 reads that are 24 nucleotides long that should be built that way:

            
        1  4        5                 24
        CAAG/TCGG - XXXXXXXXXXXXXXXXXXXX
         barcode  -   RNA 5'-end (readseq)
  • The 4 first nucleotides correspond to a barcode sequence which should always be either “CAAG” or “TCGG”

  • Then we have the 5’-end of the transcript

fq_streamer <- FastqStreamer("../extdata/pre-processing_examples/Example_2.fastq.gz")
sr <- yield(fq_streamer)
close(fq_streamer)
sr@sread
#> DNAStringSet object of length 10000:
#>         width seq
#>     [1]    24 CAAGTTCAGCAGGAATGCCGAGAC
#>     [2]    24 CAAGCAGCAGGAATGCCGAGACCG
#>     [3]    24 CAAGGAAGTAGAATCAGTAATTTG
#>     [4]    24 CAAGCATGCGAGAGCGGTATTATC
#>     [5]    24 CAAGTTAAGGGCGCACGGTGGATG
#>     ...   ... ...
#>  [9996]    24 TCGGTTAAGTTATTAAGGGCGCAC
#>  [9997]    24 TCGGTAGGATGTTGGCTTAGAAGC
#>  [9998]    24 TCGGTTAAGTTATTAAGGGCGCAC
#>  [9999]    24 TCGGCAGCAGGAATGCCGAGACCG
#> [10000]    24 TCGGTTCAGCAGGAATGCCGAGAC

Now, we build the read_features table in order to extract the mappable part of reads and regroup them into separate output FASTQ files according to their barcode sequence.

rf <- init_read_features(start = 5, width = 20)
rf <- add_read_feature(rf,
                       name = "barcode",
                       start = 1,
                       width = 4,
                       pattern = c("TCGG","CAAG"),
                       pattern_type = "constant",
                       readid_prepend = FALSE)
rf %>% 
  print
#> # A tibble: 2 × 7
#> # Rowwise: 
#>   name    start width pattern_type pattern    max_mismatch readid_prepend
#>   <chr>   <dbl> <dbl> <chr>        <chr>             <dbl> <lgl>         
#> 1 readseq     5    20 nucleotides  ACTG                  0 FALSE         
#> 2 barcode     1     4 constant     TCGG, CAAG            0 FALSE

Note that it is mandatory to name the feature you want to use for demultiplexing: barcode

-the start position and the width are set according to the read structure we defined,

  • the pattern parameter is a vector containing the 2 allowed barcode sequences “TCGG” and “CAAG”,

  • The pattern_type is set to “constant” because the pattern correspond to exact sequences.

Now that we have the read_feature table, we can extract the mappable part of reads that have a valid barcode sequence and split them into different output FASTQ files according to the feature named barcode using the EMOTE_demultiplex function:

report <- demultiplex(
  fastq_file  = "../extdata/pre-processing_examples/Example_2.fastq.gz",
  features = rf,
  force = TRUE
)
report
#> # A tibble: 5 × 5
#>   barcode_group is_valid_readseq is_valid_barcode count demux_filename                                                                  
#>   <chr>         <lgl>            <lgl>            <int> <chr>                                                                           
#> 1 CAAG          FALSE            TRUE                17 <NA>                                                                            
#> 2 CAAG          TRUE             TRUE              4960 ../extdata/pre-processing_examples/Example_2_demux/Example_2_CAAG_valid.fastq.gz
#> 3 INVALID       TRUE             FALSE               27 <NA>                                                                            
#> 4 TCGG          FALSE            TRUE                12 <NA>                                                                            
#> 5 TCGG          TRUE             TRUE              4984 ../extdata/pre-processing_examples/Example_2_demux/Example_2_TCGG_valid.fastq.gz

demultiplex also returns a report with statistics about the validity of features. These statistics are grouped by barcode.

Once again, if we take a look at one of the two output files, we see that we only kept the part we wanted to extract:

fq_streamer = FastqStreamer("../extdata/pre-processing_examples/Example_2_demux/Example_2_TCGG_valid.fastq.gz")
sr <- yield(fq_streamer)
close(fq_streamer)
sr@sread
#> DNAStringSet object of length 4984:
#>        width seq
#>    [1]    20 AGAAAAGCCAGATTTAATTA
#>    [2]    20 TTTAAAGAATTAGATCAAAA
#>    [3]    20 TGAATGACAATATGTCAACG
#>    [4]    20 TTAAGTTATTAAGGGCGCAC
#>    [5]    20 TAAGTTATTAAGGGCGCACG
#>    ...   ... ...
#> [4980]    20 TTAAGTTATTAAGGGCGCAC
#> [4981]    20 TAGGATGTTGGCTTAGAAGC
#> [4982]    20 TTAAGTTATTAAGGGCGCAC
#> [4983]    20 CAGCAGGAATGCCGAGACCG
#> [4984]    20 TTCAGCAGGAATGCCGAGAC

Removal of a pattern from the mappable part of reads

For this example, we have 10000 reads that have no specific structure but due to a step during the experimental protocol, reads can have a poly A sequence that should be removed in order to map on the reference genome.

fq_streamer = FastqStreamer("../extdata/pre-processing_examples/Example_3.fastq.gz")
sr <- yield(fq_streamer)
close(fq_streamer)
sr@sread
#> DNAStringSet object of length 10000:
#>         width seq
#>     [1]    50 TCAGCCAGTGCACGCAAAAAAAAAAAAAAAAAAAAAAAAAAAATCGGAAG
#>     [2]    50 TGGGGGTAGCGAGCTTGTGGGAAGCGGTCTTTGGCGATGTGGGGGTTAAA
#>     [3]    50 TTCGAATCCTACATCTGGAGCCAAAAAAAAAAAAAAAAAAAAAAAAAAAA
#>     [4]    50 ATTATCCAGTCCATGTCTTTGATTATTGCCATGATGTATATTGGGGCTAA
#>     [5]    50 AATTCTTGGTGTAGGGGTAAAATCCGTAGAGATCAAGAGGAAAAAAAAAA
#>     ...   ... ...
#>  [9996]    50 CACCGCCCGTCACACCATGGGAGTTGTGTTTGCCTTAAGTCAGGATGCTA
#>  [9997]    50 CATTAAAGGCGAAGGAGTCTTATACAAAAAAAAAAAAAAAAAAAAAAAAA
#>  [9998]    50 ATTGGAGTGAAGGCAAATCCACCTCTGTATTTGAAAAAAAAAAAAAAAAA
#>  [9999]    50 AGCACCGCTATAGGAACTCAACCTATGGTTCACCTTTGCATCAGCATTGA
#> [10000]    50 CACACCATGGGAGTTGTGTTTGCCCCAAAAAAAAAAAAAAAAAAAAAAAA

We see that several reads have a succession of As that should not map on the genome.

We will thus parse them in order to remove these poly A sequences using parse_read. For this, we first have to build a read_features table:

rf <- init_read_features(start = 1, width = 50)

For this case, the positions of the sequence to extract is variable since the locations of poly A are also variable.

We set the start position to 1 and the width to 50 (the entire read sequence), and the extraction of the mappable part will be done after the removal of the poly A. The pattern is specified as follows:

rf <- add_read_feature(rf,
                       name = "polyA",
                       start = 1,
                       width = 50,
                       pattern = "AAAAAA.*",
                       pattern_type = "regexp",
                       readid_prepend = FALSE)
rf %>% 
  print
#> # A tibble: 2 × 7
#> # Rowwise: 
#>   name    start width pattern_type pattern  max_mismatch readid_prepend
#>   <chr>   <dbl> <dbl> <chr>        <chr>           <dbl> <lgl>         
#> 1 readseq     1    50 nucleotides  ACTG                0 FALSE         
#> 2 polyA       1    50 regexp       AAAAAA.*            0 FALSE

As the poly A sequence can be found anywhere on the reads we set the start position to 1 and the width to 50.

The pattern corresponds to a regular expression: here, we specify a regular expression of minimum 6 A followed anything.

The pattern being a regular expression to identity and remove with whatever is behind it, we set the pattern_type to “regexp”.

With this read_features table we can perform an parse_read:

report = parse_reads(
            fastq_file  = "../extdata/pre-processing_examples/Example_3.fastq.gz",
            features = rf,
            force = TRUE)
report
#> # A tibble: 2 × 2
#>   is_valid_readseq count
#>   <lgl>            <int>
#> 1 FALSE             1788
#> 2 TRUE              8212

The report indicates that among the 10000 reads, only 8212 were valid due to the validity of readseq.

Note that the invalidity of readseq may also be due to too short read after the removal of the poly A.

fq_streamer = FastqStreamer("../extdata/pre-processing_examples/Example_3_demux/Example_3_valid.fastq.gz")
sr <- yield(fq_streamer)
close(fq_streamer)
sr@sread
#> DNAStringSet object of length 8212:
#>        width seq
#>    [1]    50 TGGGGGTAGCGAGCTTGTGGGAAGCGGTCTTTGGCGATGTGGGGGTTAAA
#>    [2]    22 TTCGAATCCTACATCTGGAGCC
#>    [3]    50 ATTATCCAGTCCATGTCTTTGATTATTGCCATGATGTATATTGGGGCTAA
#>    [4]    40 AATTCTTGGTGTAGGGGTAAAATCCGTAGAGATCAAGAGG
#>    [5]    50 AATGGGGTGCACAAAGAGAAGCAATACTGCGAAGTGGAGCCAATCTTCAA
#>    ...   ... ...
#> [8208]    50 CACCGCCCGTCACACCATGGGAGTTGTGTTTGCCTTAAGTCAGGATGCTA
#> [8209]    25 CATTAAAGGCGAAGGAGTCTTATAC
#> [8210]    33 ATTGGAGTGAAGGCAAATCCACCTCTGTATTTG
#> [8211]    50 AGCACCGCTATAGGAACTCAACCTATGGTTCACCTTTGCATCAGCATTGA
#> [8212]    26 CACACCATGGGAGTTGTGTTTGCCCC

As we can see, poly A are removed from the reads, generating reads of different lengths.

It is possible to set a minimal length allowed in the parse_reads function (default is 18).