Illumina MiSeq 2x250 BCR mRNA
Overview of Experimental Data
This example uses publicly available data from:
Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice.Greiff, V. et al.BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.
Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For this example, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which may downloaded using fastq-dump from the SRA Toolkit:
fastq-dump --split-files -X 25000 ERR346600
Primers sequences are available in additional file 1 of the publication.
Read Configuration
Example Data
We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format, with accompanying primer files and an example workflow script. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow
In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart
Commands
1#!/usr/bin/env bash
2AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
3 --coord sra --rc tail --outname M1 --log AP.log
4FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log
5MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
6 --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
7MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
8 --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log
9CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
10 --cf VPRIMER --act set --outname M1
11SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
12ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER
13ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
14ParseLog.py -l FS.log -f ID QUALITY
15ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR
Paired-end assembly
Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand of the AssemblePairs.py tool:
2AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
3 --coord sra --rc tail --outname M1 --log AP.log
During assembly we have defined read 2 (V-region) as the head of the sequence
(-1
) and read 1 as the tail of the sequence
(-2
). The
--coord
argument defines the format of
the sequence header so that AssemblePairs.py can properly identify mate-pairs;
in this case, we use --coord sra
as our headers are in the SRA/ENA format.
Note
For both the AssemblePairs.py and PairSeq.py commands using the
correct --coord
argument is critical
for matching mate-pairs. If this was raw data from Illumina, rather than
data downloaded from SRA/ENA, then the appropriate argument would be
--coord illumina
.
The ParseLog.py tool is then used to build a tab-delimited file of results from the AssemblePairs.py log file:
13ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
Which will containing the following columns:
Field |
Description |
---|---|
ID |
Sequence name |
LENGTH |
Length of the assembled sequence |
OVERLAP |
Length of the overlap between mate-pairs |
ERROR |
Mismatch rate of the overlapping region |
PVALUE |
P-value for the assembly |
See also
Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs.py can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.
Quality control and primer annotation
Removal of low quality reads
Quality control begins with the identification and removal of
low-quality reads using the quality subcommand of the FilterSeq.py tool.
In this example, reads with mean Phred quality scores less than
20 (-q 20
) are removed:
4FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log
The ParseLog.py tool is then used to build tab-delimited file from the FilterSeq.py log:
14ParseLog.py -l FS.log -f ID QUALITY
Capturing the following annotations:
Field |
Description |
---|---|
ID |
Sequence name |
QUALITY |
Quality score |
Read annotation and masking of primer regions
When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The score subcommand of MaskPrimers.py is used to identify and remove the V-segment and C-region PCR primers for both reads:
5MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
6 --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
7MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
8 --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log
In this data set the authors have added a random sequence of 4 bp to the
start of each read before the primer sequence to increase sequence diversity
and the reliability of cluster calling on the Illumina platform.
As such, both primers begin at position 4 (--start 4
),
but the C-region primer begins 4 bases from the end of the assembled read.
The addition of the --revpr
argument to the
second MaskPrimers.py step instructs the tool to reverse complement the primer
sequences and check the tail of the read. The two primer regions have also been treated
differently. The V-segment primer has been masked (replaced by Ns) using the
--mode mask
argument
to preserve the V(D)J length, while the C-region primer has been removed
from the sequence using the --mode cut
argument.
During each iteration of the MaskPrimers.py tool an
annotation field was added with the name of the primer, with the name of the field
specified by the --pf
argument, such that after both
iterations each sequence contains an annotation of the form:
VPRIMER=V-segment primer|CPRIMER=C-region primer
To summarize these steps, the ParseLog.py tool is used to build tab-delimited files from the two MaskPrimers.py logs:
15ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR
Capturing the following annotations:
Field |
Description |
---|---|
ID |
Sequence name |
PRIMER |
Primer name |
ERROR |
Primer match error rate |
Note
This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read 1 is always the C-region end of the amplicon and read 2 is always the V-segment end. If your data is unstranded (50% of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers.py step to account for this by using the align subcommand instead:
MaskPrimers.py align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \
--maxlen 30 --mode mask --pf VPRIMER --log MP1.log
This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read for matches, and correcting the the output sequences to the forward orientation (V to J).
Deduplication and filtering
Removal of duplicate sequences
The last stage of the workflow involves two filtering steps to yield
the final repertoire. First, the set of unique sequences is identified using the
CollapseSeq.py tool, allowing for up to 20 interior N-valued positions
(-n 20
and --inner
),
and requiring that all reads considered duplicates share the same C-region primer
annotation (--uf CPRIMER
). Additionally, the
V-segment primer annotations of the set of duplicate reads are propagated into the
annotation of each retained unique sequence (--cf VPRIMER
and --act set
):
9CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
10 --cf VPRIMER --act set --outname M1
Filtering to repeated sequences
CollapseSeq stores the count of duplicate reads for each sequence in the
DUPCOUNT
annotation. Following duplicate removal, the data is subset
to only those unique sequence with at least two representative reads by
using the group subcommand of SplitSeq.py on the count field
(-f DUPCOUNT
) and specifying a numeric threshold
(--num 2
):
11SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
Creating an annotation table
Finally, the annotations, including duplicate read count (DUPCOUNT
), isotype
(CPRIMER
) and V-segment primer (VPRIMER
), of the final repertoire are then
extracted from the SplitSeq.py output into a tab-delimited file using the
table subcommand of ParseHeaders.py:
12ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER
Output files
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File |
Description |
---|---|
M1_collapse-unique.fastq |
Total unique sequences |
M1_atleast-2.fastq |
Unique sequences represented by at least 2 reads |
M1_atleast-2_headers.tab |
Annotation table of the atleast-2 file |
AP_table.tab |
Table of the AssemblePairs log |
FS_table.tab |
Table of the FilterSeq log |
MPV_table.tab |
Table of the V-segment MaskPrimers log |
MPC_table.tab |
Table of the C-region MaskPrimers log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.