Illumina MiSeq 2x250 BCR mRNA

Overview of Experimental Data

This example uses publicly available data from:

Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice.
Greiff, V. et al.
BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.

Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For this example, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which may downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 ERR346600

Primers sequences are available in additional file 1 of the publication.

Read Configuration


Schematic of Illumina MiSeq 2x250 stranded paired-end reads without UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains the C-region primer sequence. Both reads begin with a random sequences of 4 nucleotides.

Example Data

We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format, with accompanying primer files and an example workflow script. The sample data set and workflow script may be downloaded from here:

Greiff et al, 2014 Example Files

Overview of the Workflow

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.



Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) paired-end assembly, (green) quality control and primer annotation, and deduplication and filtering (blue). The intermediate files output by each tool are not shown for the sake of brevity.


#!/usr/bin/env bash align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1 group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1 table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE -l FS.log -f ID QUALITY -l MPV.log MPC.log -f ID PRIMER ERROR

Download Commands

Paired-end assembly

Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand of the AssemblePairs tool:

3 align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log

During assembly we have defined read 2 (V-region) as the head of the sequence (-1) and read 1 as the tail of the sequence (-2). The --coord argument defines the format of the sequence header so that AssemblePairs can properly identify mate-pairs; in this case, we use --coord sra as our headers are in the SRA/ENA format.


For both the AssemblePairs and PairSeq commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then the appropriate argument would be --coord illumina.

The ParseLog tool is then used to build a tab-delimited file of results from the AssemblePairs log file:


Which will containing the following columns:

Field Description
ID Sequence name
LENGTH Length of the assembled sequence
OVERLAP Length of the overlap between mate-pairs
ERROR Mismatch rate of the overlapping region
PVALUE P-value for the assembly

See also

Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.

Quality control and primer annotation

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

4 quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log

The ParseLog tool is then used to build tab-delimited file from the FilterSeq log:

14 -l FS.log -f ID QUALITY

Capturing the following annotations:

Field Description
ID Sequence name
QUALITY Quality score

Read annotation and masking of primer regions

When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The score subcommand of MaskPrimers is used to identify and remove the V-segment and C-region PCR primers for both reads:

8 score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log

In this data set the authors have added a random sequence of 4 bp to the start of each read before the primer sequence to increase sequence diversity and the reliability of cluster calling on the Illumina platform. As such, both primers begin at position 4 (--start 4), but the C-region primer begins 4 bases from the end of the assembled read. The addition of the --revpr argument to the second MaskPrimers step instructs the tool to reverse complement the primer sequences and check the tail of the read. The two primer regions have also been treated differently. The V-segment primer has been masked (replaced by Ns) using the --mode mask argument to preserve the V(D)J length, while the C-region primer has been removed from the sequence using the --mode cut argument.

During each iteration of the MaskPrimers tool an annotation field was added with the name of the primer, with the name of the field specified by the --pf argument, such that after both iterations each sequence contains an annotation of the form:

VPRIMER=V-segment primer|CPRIMER=C-region primer

To summarize these steps, the ParseLog tool is used to build tab-delimited files from the two MaskPrimers logs:

15 -l MPV.log MPC.log -f ID PRIMER ERROR

Capturing the following annotations:

Field Description
ID Sequence name
PRIMER Primer name
ERROR Primer match error rate


This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read 1 is always the C-region end of the amplicon and read 2 is always the V-segment end. If your data is unstranded (50% of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers step to account for this by using the align subcommand instead: align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --maxlen 30 --mode mask --pf VPRIMER --log MP1.log

This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read for matches, and correcting the the output sequences to the forward orientation (V to J).

Deduplication and filtering

Removal of duplicate sequences

The last stage of the workflow involves two filtering steps to yield the final repertoire. First, the set of unique sequences is identified using the CollapseSeq tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicates share the same C-region primer annotation (--uf CPRIMER). Additionally, the V-segment primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

10 -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

11 group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1

Creating an annotation table

Finally, the annotations, including duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-segment primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimited file using the table subcommand of ParseHeaders:

12 table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File Description
M1_collapse-unique.fastq Total unique sequences
M1_atleast-2.fastq Unique sequences represented by at least 2 reads Annotation table of the atleast-2 file Table of the AssemblePairs log Table of the FilterSeq log Table of the V-segment MaskPrimers log Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.