UMI Barcoded Illumina MiSeq 325+275 paired-end 5’RACE BCR mRNA
Overview of Experimental Data
This example uses publicly available data from:
Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing.Vander Heiden JA, et al.J Immunol. 2017 198(4):1460-1473. doi:10.4049/jimmunol.1601415.
which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample HD09_N_AB8KB (accession: SRR4026043), which may downloaded using fastq-dump from the SRA Toolkit:
fastq-dump --split-files -X 25000 SRR4026043
Primer sequences are available online from the protocols/AbSeq directory in the Immcantation repository.
Read Configuration
Schematic of the Illumina MiSeq 325+275 paired-end reads with UMI barcodes. Each read was sequenced from one end of the target cDNA so that the two reads together cover the entire variable region of the Ig heavy chain. Sequencing was performed using an uneven number of cycles for read 1 and read 2 using a 2x300 kit. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), contains a partial C-region, and is 325 nucleotides in length. Read 2 contains the 5’RACE template switch site with a 17 nucleotide UMI barcode preceding it, and is 275 nucleotides in length.
Example Data
We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow
In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart
Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into four primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensus sequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtain the high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.
Commands
1#!/usr/bin/env bash
2FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
3FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log
4MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
5 --start 0 --mode cut --outname HD09N-R1 --log MP1.log
6MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
7 --start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log
8PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
9 --2f BARCODE --coord sra
10BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
11 --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
12BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
13 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log
14PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
15 --coord presto
16AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
17 -2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
18 --coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
19 --aligner blastn --outname HD09N-C --log AP.log
20MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
21 -p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
22 --mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log
23ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min
24CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
25 --uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C
26SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
27 -f CONSCOUNT --num 2 --outname HD09N-C
28ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT
29ParseLog.py -l FS1.log FS2.log -f ID QUALITY
30ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR
31ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR
32ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY
Quality control, UMI annotation and primer masking
Removal of low quality reads
Quality control begins with the identification and removal of
low-quality reads using the quality subcommand of the
FilterSeq.py tool. In this example, reads with mean Phred quality scores
less than 20 (-q 20
) are removed:
2FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
3FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log
The ParseLog.py tool is then used to extract results from the FilterSeq.py logs into tab-delimited files:
29ParseLog.py -l FS1.log FS2.log -f ID QUALITY
Extracting the following information from the log:
Field |
Description |
---|---|
ID |
Sequence name |
QUALITY |
Quality score |
UMI annotation and removal of primer regions
Next, the score subcommand of MaskPrimers.py is
used to identify and remove the PCR primers for both reads. When
dealing with Ig sequences, it is important to cut or mask the primers,
as B cell receptors are subject to somatic hypermutation (the
accumulation of point mutations in the DNA) and degenerate primer
matches can look like mutations in downstream applications. The
MaskPrimers tool is also used to annotate each read 2 sequence
with the 17 nucleotide UMI that precedes the template switch site.
(MaskPrimers score --barcode
):
4MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
5 --start 0 --mode cut --outname HD09N-R1 --log MP1.log
6MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
7 --start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log
To summarize these steps, the ParseLog.py tool is used to build a tab-delimited file from the MaskPrimers.py log:
30ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR
Containing the following information:
Field |
Description |
---|---|
ID |
Sequence name |
PRIMER |
Primer name |
BARCODE |
UMI sequence |
ERROR |
Primer match error rate |
Note
For this data, we are using the 5’RACE template switch sequences
as proxy primers. We’ve set the match error rate extremely high
(--maxerror 0.5
)
because an accurate match isn’t important. Mostly, we are just concerned
with extracting the UMI barcode that precedes the template switch site.
Generation of UMI consensus sequences
Copying the UMI annotation across paired-end files
In this task, a single consensus sequence is constructed for each set of
reads annotated with the same UMI barcode. As the UMI barcode is part of
read 2, the BARCODE
annotation identified by MaskPrimers.py must
first be copied to the read 1 mate-pair of each read 2
sequence. Propogation of annotations between mate pairs is performed
using PairSeq.py which also removes
unpaired reads and ensures that paired reads are sorted in the same
order across files:
8PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
9 --2f BARCODE --coord sra
Note
For both the PairSeq.py and AssemblePairs.py commands using the
correct --coord
argument is critical
for matching mate-pairs. If this was raw data from Illumina, rather than
data downloaded from SRA, then the appropriate argument would be
--coord illumina
.
Multiple alignment of UMI read groups
Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar. This type of primer misalignment can be corrected using the AlignSets.py tool.
This step is not required for this data, but if necessary for your data see the section on fixing UMI problems.
Generating UMI consensus reads
After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus.py:
10BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
11 --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
12BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
13 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log
To correct for UMI chemistry and sequencing errors, UMI read groups having
high error statistics (mismatch rate from consensus) are removed by
specifiying the --maxerror 0.1
threshold. Additional filtering of read 1 is carried out
during this step by specifying the --prcons 0.6
threshold which: (a) removes individual sequences that do not share a common primer annotation with
the majority of the set, (b) removes entire read groups which have
ambiguous primer assignments, and (c) constructs a consensus primer
assignment for each UMI.
Note
The --maxgap 0.5
argument tells
BuildConsensus.py to use a majority rule to delete any gap positions
which occur in more than 50% of the reads. The --maxgap
argument is not really necessary for this example data set as we did not perform
a multiple alignment of the UMI read groups. However, if you have performed an
alignment, then use of --maxgap
during consensus
generation is highly recommended.
The ParseLog.py tool is then used to build a tab-delimited file contain the consensus results:
31ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR
With the following annotations:
Field |
Description |
---|---|
BARCODE |
UMI sequence |
SEQCOUNT |
Number of total reads in the UMI group |
CONSCOUNT |
Number of reads used for the UMI consensus |
PRCONS |
Consensus primer name |
PRFREQ |
Frequency of primers in the UMI group |
ERROR |
Average mismatch rate from consensus |
Paired-end assembly of UMI consensus sequences
Syncronizing paired-end files
Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus.py. To synchronize the reads another instance of PairSeq.py must be run, but without any annotation manipulation:
14PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
15 --coord presto
Assembling UMI consensus mate-pairs
Once the files have been synchronized, each paired-end UMI consensus
sequence is assembled into a full length Ig sequence using the
sequential subcommand of AssemblePairs.py.
5’RACE creates long amplicons which are not guaranteed to overlap when sequenced
using a 2x300 kit. The sequential subcommand will first attempt de
novo paired-end assembly. If that approach fails, it will attempt assembly guided
by the V segment reference sequences specified by the
-r
argument. Mate-pairs that fail to
overlap can thus be assembled, with the full length sequence containing an
appropriate number of intervening between the two reads.
16AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
17 -2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
18 --coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
19 --aligner blastn --outname HD09N-C --log AP.log
During assembly, the consensus primer annotation (PRCONS
) from read 1
and the number of reads used to define the consensus sequence (CONSCOUNT
)
for both reads are propagated into the annotations of the full length Ig sequence
(--1f CONSCOUNT --2f CONSCOUNT PRCONS
.
ParseLog.py is then uses to extract the results from the AssemblePairs.py log into a tab-delimited file:
32ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY
Containing the following information:
ID REFID LENGTH OVERLAP GAP ERROR IDENTITY
Field |
Description |
---|---|
ID |
Sequence name (UMI) |
REFID |
The reference sequence name for reference guided assembly |
LENGTH |
Length of the assembled sequence |
OVERLAP |
Length of the overlap between mate-pairs |
GAP |
Length of the gap between mate-pairs |
ERROR |
Mismatch rate of the overlapping region from de novo assembly |
IDENTITY |
Identity score for the aligned region in reference guided assembly |
Deduplication and filtering
Annotating sequences with the constant region
In the final stage of the workflow, the high-fidelity Ig repertoire is obtained by a series of filtering steps. First, an additional annotation defining the constant region is added to each sequence by aligning against the constant region sequence internal to the primer. This step is not strictly necessary, as the constant region primer is usually sufficient for isotype calls. However, primer calls may be inaccurate with some primer sets, so this additional confirmatory step is beneficial.
Internal constant regions are assigned using the align subcommand
of MaskPrimers.py with a set of manually constructed sequences as the
“primers” (-p AbSeq_Human_IG_InternalCRegion.fasta
):
20MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
21 -p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
22 --mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log
Here we use a high error rate for the match
(--maxerror 0.3
), restrict the match
to 100 nucleotides at the end of the sequence
(--maxlen 100 --revpr --skiprc
), and
use mode the argument --mode tag
to specify
that the input sequence should not be modified. The aligned C-region name
will be added to each sequence headers in the CREGION
field
(--pf CREGION
):
Combining UMI read group size annotations
The annotation specifying the number of raw reads used to build each sequence
(-f CONSCOUNT
) is updated to be the
minimum (--act min
) of the
forward and reverse reads using the
collapse subcommand of ParseHeaders.py:
23ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min
Removal of duplicate sequences
Second, duplicate nucleotide sequences are removed using the CollapseSeq.py
tool with the requirement that duplicate sequences share the same
constant region annotation (--uf CREGION
). The duplicate removal
step also removes sequences with a high number of interior N-valued nucleotides
(-n 20
and --inner
)
and combines the read counts for each UMI read group
(--cf CONSCOUNT
and --act sum
).
24CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
25 --uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C
Filtering to sequences with at least two representative reads
Finally, unique sequences are filtered to those with at least 2
contributing sequences using the group subcommand of SplitSeq.py,
by splitting the file on the CONSCOUNT
annotation with a numeric threshold
(-f CONSCOUNT
and --num 2
):
26SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
27 -f CONSCOUNT --num 2 --outname HD09N-C
Creating an annotation table
For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders.py:
28ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT
Output files
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File |
Description |
---|---|
HD09N-C_collapse-unique.fastq |
Total unique sequences |
HD09N-C_atleast-2.fastq |
Unique sequences represented by at least 2 reads |
HD09N-C_atleast-2_headers.tab |
Annotation table of the atleast-2 file |
FS1_table.tab |
Table of the read 1 FilterSeq log |
FS2_table.tab |
Table of the read 2 FilterSeq log |
MP1_table.tab |
Table of the C-region primer MaskPrimers log |
MP2_table.tab |
Table of the TS site MaskPrimers log |
MP3_table.tab |
Table of the internal C-region MaskPrimers log |
BC1_table.tab |
Table of the read 1 BuildConsensus log |
BC2_table.tab |
Table of the read 2 BuildConsensus log |
AP_table.tab |
Table of the AssemblePairs log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.