Illumina MiSeq 2x250 BCR mRNA¶
Overview of Experimental Data¶
This example uses publicly available data from:
Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice.Greiff, V. et al.BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.
Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For this example, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which may downloaded using fastq-dump from the SRA Toolkit:
fastq-dump --split-files -X 25000 ERR346600
Primers sequences are available in additional file 1 of the publication.
Read Configuration¶
Schematic of Illumina MiSeq 2x250 stranded paired-end reads without UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains the C-region primer sequence. Both reads begin with a random sequences of 4 nucleotides.¶
Example Data¶
We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format, with accompanying primer files and an example workflow script. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow¶
In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart¶
Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) paired-end assembly, (green) quality control and primer annotation, and deduplication and filtering (blue). The intermediate files output by each tool are not shown for the sake of brevity.¶
Commands¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | #!/usr/bin/env bash AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \ --coord sra --rc tail --outname M1 --log AP.log FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \ --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \ --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \ --cf VPRIMER --act set --outname M1 SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1 ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE ParseLog.py -l FS.log -f ID QUALITY ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR |
Paired-end assembly¶
Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand of the AssemblePairs.py tool:
2 3 | AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \ --coord sra --rc tail --outname M1 --log AP.log |
During assembly we have defined read 2 (V-region) as the head of the sequence
(-1
) and read 1 as the tail of the sequence
(-2
). The
--coord
argument defines the format of
the sequence header so that AssemblePairs.py can properly identify mate-pairs;
in this case, we use --coord sra
as our headers are in the SRA/ENA format.
Note
For both the AssemblePairs.py and PairSeq.py commands using the
correct --coord
argument is critical
for matching mate-pairs. If this was raw data from Illumina, rather than
data downloaded from SRA/ENA, then the appropriate argument would be
--coord illumina
.
The ParseLog.py tool is then used to build a tab-delimited file of results from the AssemblePairs.py log file:
13
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
Which will containing the following columns:
Field |
Description |
---|---|
ID |
Sequence name |
LENGTH |
Length of the assembled sequence |
OVERLAP |
Length of the overlap between mate-pairs |
ERROR |
Mismatch rate of the overlapping region |
PVALUE |
P-value for the assembly |
See also
Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs.py can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.
Quality control and primer annotation¶
Removal of low quality reads¶
Quality control begins with the identification and removal of
low-quality reads using the quality subcommand of the FilterSeq.py tool.
In this example, reads with mean Phred quality scores less than
20 (-q 20
) are removed:
4 | FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log |
The ParseLog.py tool is then used to build tab-delimited file from the FilterSeq.py log:
14 | ParseLog.py -l FS.log -f ID QUALITY |
Capturing the following annotations:
Field |
Description |
---|---|
ID |
Sequence name |
QUALITY |
Quality score |
Read annotation and masking of primer regions¶
When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The score subcommand of MaskPrimers.py is used to identify and remove the V-segment and C-region PCR primers for both reads:
5 6 7 8 | MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \ --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \ --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log |
In this data set the authors have added a random sequence of 4 bp to the
start of each read before the primer sequence to increase sequence diversity
and the reliability of cluster calling on the Illumina platform.
As such, both primers begin at position 4 (--start 4
),
but the C-region primer begins 4 bases from the end of the assembled read.
The addition of the --revpr
argument to the
second MaskPrimers.py step instructs the tool to reverse complement the primer
sequences and check the tail of the read. The two primer regions have also been treated
differently. The V-segment primer has been masked (replaced by Ns) using the
--mode mask
argument
to preserve the V(D)J length, while the C-region primer has been removed
from the sequence using the --mode cut
argument.
During each iteration of the MaskPrimers.py tool an
annotation field was added with the name of the primer, with the name of the field
specified by the --pf
argument, such that after both
iterations each sequence contains an annotation of the form:
VPRIMER=V-segment primer|CPRIMER=C-region primer
To summarize these steps, the ParseLog.py tool is used to build tab-delimited files from the two MaskPrimers.py logs:
15 | ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR |
Capturing the following annotations:
Field |
Description |
---|---|
ID |
Sequence name |
PRIMER |
Primer name |
ERROR |
Primer match error rate |
Note
This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read 1 is always the C-region end of the amplicon and read 2 is always the V-segment end. If your data is unstranded (50% of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers.py step to account for this by using the align subcommand instead:
MaskPrimers.py align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \
--maxlen 30 --mode mask --pf VPRIMER --log MP1.log
This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read for matches, and correcting the the output sequences to the forward orientation (V to J).
Deduplication and filtering¶
Removal of duplicate sequences¶
The last stage of the workflow involves two filtering steps to yield
the final repertoire. First, the set of unique sequences is identified using the
CollapseSeq.py tool, allowing for up to 20 interior N-valued positions
(-n 20
and --inner
),
and requiring that all reads considered duplicates share the same C-region primer
annotation (--uf CPRIMER
). Additionally, the
V-segment primer annotations of the set of duplicate reads are propagated into the
annotation of each retained unique sequence (--cf VPRIMER
and --act set
):
9 10 | CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \ --cf VPRIMER --act set --outname M1 |
Filtering to repeated sequences¶
CollapseSeq stores the count of duplicate reads for each sequence in the
DUPCOUNT
annotation. Following duplicate removal, the data is subset
to only those unique sequence with at least two representative reads by
using the group subcommand of SplitSeq.py on the count field
(-f DUPCOUNT
) and specifying a numeric threshold
(--num 2
):
11 | SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1 |
Creating an annotation table¶
Finally, the annotations, including duplicate read count (DUPCOUNT
), isotype
(CPRIMER
) and V-segment primer (VPRIMER
), of the final repertoire are then
extracted from the SplitSeq.py output into a tab-delimited file using the
table subcommand of ParseHeaders.py:
12 | ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER |
Output files¶
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File |
Description |
---|---|
M1_collapse-unique.fastq |
Total unique sequences |
M1_atleast-2.fastq |
Unique sequences represented by at least 2 reads |
M1_atleast-2_headers.tab |
Annotation table of the atleast-2 file |
AP_table.tab |
Table of the AssemblePairs log |
FS_table.tab |
Table of the FilterSeq log |
MPV_table.tab |
Table of the V-segment MaskPrimers log |
MPC_table.tab |
Table of the C-region MaskPrimers log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.