UMI Barcoded Illumina MiSeq 2x250 BCR mRNA¶
Overview of Experimental Data¶
This example uses publicly available data from:
B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes.Stern JNH, Yaari G, and Vander Heiden JA, et al.Sci Transl Med. 2014. 6(248):248ra107. doi:10.1126/scitranslmed.3008879.
Which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample M12 (accession: SRR1383456), which may downloaded downloaded using fastq-dump from the SRA Toolkit:
fastq-dump --split-files -X 25000 SRR1383456
Primers sequences are available online at the supplemental website for the publication.
Read Configuration¶
Example Data¶
We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow¶
In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart¶
Commands¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | #!/usr/bin/env bash
FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log
MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
--start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
--start 0 --mode mask --outname MS12_R2 --log MP2.log
PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
--1f BARCODE --coord sra
BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log
PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
--coord presto
AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
-2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
--1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log
ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min
CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
--cf CONSCOUNT --act sum --outname MS12
SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12
ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT
ParseLog.py -l FS1.log FS2.log -f ID QUALITY
ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \
PRFREQ ERROR
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE FIELDS1 FIELDS2
|
Quality control, UMI annotation and primer masking¶
Removal of low quality reads¶
Quality control begins with the identification and removal of
low-quality reads using the quality subcommand of the
FilterSeq tool. In this example, reads with mean Phred quality scores
less than 20 (-q 20
) are removed:
2 3 | FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log
|
The ParseLog tool is then used to extract results from the FilterSeq logs into tab-delimited files:
24 | ParseLog.py -l FS1.log FS2.log -f ID QUALITY
|
Extracting the following information from the log:
Field |
Description |
---|---|
ID |
Sequence name |
QUALITY |
Quality score |
UMI annotation and masking of primer regions¶
Next, the score subcommand of MaskPrimers is
used to identify and remove the PCR primers for both reads. When
dealing with Ig sequences, it is important to cut or mask the primers,
as B cell receptors are subject to somatic hypermutation (the
accumulation of point mutations in the DNA) and degenerate primer
matches can look like mutations in downstream applications. The
MaskPrimers tool is also used to annotate each read 1 sequence
with the 15 nucleotide UMI that precedes the C-region primer
(MaskPrimers score --barcode
):
4 5 6 7 | MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
--start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
--start 0 --mode mask --outname MS12_R2 --log MP2.log
|
To summarize these steps, the ParseLog tool is used to build a tab-delimited file from the MaskPrimers log:
25 | ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR
|
Containing the following information:
Field |
Description |
---|---|
ID |
Sequence name |
PRIMER |
Primer name |
BARCODE |
UMI sequence |
ERROR |
Primer match error rate |
Note
For this data set the UMI is immediately upstream of the C-region primer.
Another common approach for UMI barcoding involves placing the UMI
immediately upstream of a 5’RACE template switch site. Modifying the
workflow is simple for this case. You just need to replace the V-segment
primers with a fasta file containing the TS sequences and move the
--barcode
argument to the
appropriate read:
MaskPrimers.py score -s R1_quality-pass.fastq -p CPrimers.fasta \
--start 0 --mode cut --outname R1 --log MP1.log
MaskPrimers.py score -s R2_quality-pass.fastq -p TSSites.fasta \
--start 17 --barcode --mode cut --maxerror 0.5 \
--outname R2 --log MP2.log
In the above we have moved the UMI annotation to read 2, increased
the allowable error rate for matching the TS site
(--maxerror 0.5
),
cut the TS site (--mode cut
),
and increased the size of the UMI from 15 to 17 nucleotides
(--start 17
).
Generation of UMI consensus sequences¶
Copying the UMI annotation across paired-end files¶
In this task, a single consensus sequence is constructed for each set of
reads annotated with the same UMI barcode. As the UMI barcode is part of
read 1, the BARCODE
annotation identified by MaskPrimers must
first be copied to the read 2 mate-pair of each read 1
sequence. Propogation of annotations between mate pairs is performed
using PairSeq which also removes
unpaired reads and ensures that paired reads are sorted in the same
order across files:
8 9 | PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
--1f BARCODE --coord sra
|
Note
For both the PairSeq and AssemblePairs commands using the
correct --coord
argument is critical
for matching mate-pairs. If this was raw data from Illumina, rather than
data downloaded from SRA/ENA, then the appropriate argument would be
--coord illumina
.
Note
If you have followed the 5’RACE modification above, then you must also
modify the first PairSeq step to copy the UMI from read 2 to read 1,
instead of vice versa (--2f BARCODE
):
PairSeq.py -1 R1_primers-pass.fastq -2 R2_primers-pass.fastq \
--2f BARCODE --coord sra
Multiple alignment of UMI read groups¶
Before generating a consensus for a set of reads sharing a UMI barcode,
the sequences must be properly aligned. Sequences may not be aligned if
more than one PCR primer is identified in a UMI read group - leading to
variations in the the start positions of the reads. Ideally, each set of
reads originating from a single mRNA molecule should be amplified with
the same primer. However, different primers in the multiplex pool may be
incorporated into the same UMI read group during amplification if the
primers are sufficiently similar. This type of primer misalignment can
be corrected using the AlignSets tool. In the example data used here,
this step was not necessary due to the aligned primer design for the 45
V-segment primers, though this does require that the V-segment primers be
masked, rather than cut, during the MaskPrimers step
(--mode mask
).
See also
If your data requires alignment, then you can create multiple aligned UMI read groups as follows:
AlignSets.py muscle -s R1_primers-pass_pair-pass.fastq --bf BARCODE \
--exec ~/bin/muscle --outname R1 --log AS1.log
AlignSets.py muscle -s R2_primers-pass_pair-pass.fastq --bf BARCODE \
--exec ~/bin/muscle --outname R2 --log AS2.log
Where the --bf BARCODE
defines the field
containing the UMI and --exec ~/bin/muscle
is the location of the MUSCLE executable.
For additional details see the section on fixing UMI alignments.
Generating UMI consensus reads¶
After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus:
10 11 12 13 | BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log
|
To correct for UMI chemistry and sequencing errors, UMI read groups having
high error statistics (mismatch rate from consensus) are removed by
specifiying the --maxerror 0.1
threshold. As the accuracy of the primer assignment in read 1 is critical
for correct isotype identification, additional filtering of read 1 is carried out
during this step. Specifying the --prcons 0.6
threshold: (a) removes individual sequences that do not share a common primer annotation with
the majority of the set, (b) removes entire read groups which have
ambiguous primer assignments, and (c) constructs a consensus primer
assignment for each UMI.
Note
The --maxgap 0.5
argument tells
BuildConsensus to use a majority rule to delete any gap positions
which occur in more than 50% of the reads. The --maxgap
argument is not really necessary for this example data set as we did not perform
a multiple alignment of the UMI read groups. However, if you have performed an
alignment, then use of --maxgap
during consensus
generation is highly recommended.
The ParseLog tool is then used to build a tab-delimited file contain the consensus results:
26 | ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \
|
With the following annotations:
Field |
Description |
---|---|
BARCODE |
UMI sequence |
SEQCOUNT |
Number of total reads in the UMI group |
CONSCOUNT |
Number of reads used for the UMI consensus |
PRIMER |
Set of primer names in the UMI group |
PRCONS |
Consensus primer name |
PRCOUNT |
Count of primers in the UMI group |
PRFREQ |
Frequency of primers in the UMI group |
ERROR |
Average mismatch rate from consensus |
Paired-end assembly of UMI consensus sequences¶
Syncronizing paired-end files¶
Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus. To synchronize the reads another instance of PairSeq must be run, but without any annotation manipulation:
14 15 | PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
--coord presto
|
Assembling UMI consensus mate-pairs¶
Once the files have been synchronized, each paired-end UMI consensus sequence is assembled into a full length Ig sequence using the align subcommand of AssemblePairs:
16 17 18 | AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
-2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
--1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log
|
During assembly, the consensus isotype annotation (PRCONS
) from read 1
and the number of reads used to define the consensus sequence (CONSCOUNT
)
for both reads are propagated into the annotations of the full length Ig sequence
(--1f CONSCOUNT --2f CONSCOUNT PRCONS
.
ParseLog is then uses to extract the results from the AssemblePairs log into a tab-delimited file:
27 | PRFREQ ERROR
|
Containing the following information:
Field |
Description |
---|---|
ID |
Sequence name (UMI) |
LENGTH |
Length of the assembled sequence |
OVERLAP |
Length of the overlap between mate-pairs |
ERROR |
Mismatch rate of the overlapping region |
PVALUE |
P-value for the assembly |
FIELDS1 |
Annotations copied from read 2 into the assembled sequence |
FIELDS2 |
Annotations copied from read 1 into the assembled sequence |
See also
Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.
Deduplication and filtering¶
Combining UMI read group size annotations¶
In the final stage of the workflow, the high-fidelity Ig repertoire is
obtained by a series of filtering steps. First, the annotation
specifying the number of raw reads used to build each sequence
(-f CONSCOUNT
) is updated to be the
minimum (--act min
) of the
forward and reverse reads using the
collapse subcommand of ParseHeaders:
19 | ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min
|
Removal of duplicate sequences¶
Second, duplicate nucleotide sequences are removed using the CollapseSeq
tool with the requirement that duplicate sequences share the same
isotype primer (--uf PRCONS
). The duplicate removal
step also removes sequences with a high number of interior N-valued nucleotides
(-n 20
and --inner
)
and combines the read counts for each UMI read group
(--cf CONSCOUNT
and --act sum
).
20 21 | CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
--cf CONSCOUNT --act sum --outname MS12
|
Filtering to sequences with at least two representative reads¶
Finally, unique sequences are filtered to those with at least 2
contributing sequences using the group subcommand of SplitSeq,
by splitting the file on the CONSCOUNT
annotation with a numeric threshold
(-f CONSCOUNT
and --num 2
):
22 | SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12
|
Creating an annotation table¶
For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders:
23 | ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT
|
Output files¶
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File |
Description |
---|---|
M12_collapse-unique.fastq |
Total unique sequences |
M12_atleast-2.fastq |
Unique sequences represented by at least 2 reads |
M12_atleast-2_headers.tab |
Annotation table of the atleast-2 file |
FS1_table.tab |
Table of the read 1 FilterSeq log |
FS2_table.tab |
Table of the read 2 FilterSeq log |
MP1_table.tab |
Table of the C-region MaskPrimers log |
MP2_table.tab |
Table of the V-segment MaskPrimers log |
BC1_table.tab |
Table of the read 1 BuildConsensus log |
BC2_table.tab |
Table of the read 2 BuildConsensus log |
AP_table.tab |
Table of the AssemblePairs log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.
Performance¶
Example performance statistics for a comparable, but larger, MiSeq workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,723,558 x 2 raw reads, and required matching of 1 C-region primer, 45 V-segment primers, and averaged 24.3 reads per UMI.
Line |
Tool |
Reads |
Cores |
MB |
Minutes |
---|---|---|---|---|---|
01 |
R1: FilterSeq.py quality |
1,723,558 |
10 |
1,219 |
13.0 |
02 |
R2: FilterSeq.py quality |
1,723,558 |
10 |
1,219 |
12.8 |
03 |
R1: MaskPrimers.py score |
1,722,116 |
10 |
1,221 |
18.9 |
04 |
R2: MaskPrimers.py score |
1,684,050 |
10 |
1,221 |
46.7 |
06 |
PairSeq.py |
1,665,584 |
1 |
4,734 |
25.5 |
07 |
R1: BuildConsensus.py |
1,565,017 |
10 |
1,228 |
50.1 |
08 |
R2: BuildConsensus.py |
1,565,017 |
10 |
1,229 |
58.6 |
11 |
AssemblePairs.py align |
66,285 |
10 |
358 |
3.0 |
13 |
ParseHeaders.py collapse |
56,104 |
1 |
88 |
0.4 |
14 |
CollapseSeq.py |
55,480 |
1 |
822 |
0.7 |
15 |
SplitSeq.py group |
51,047 |
1 |
88 |
0.2 |