Illumina MiSeq 2x250 BCR mRNA¶

Overview of Experimental Data¶

This example uses publicly available data from:

Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice.

Greiff, V. et al.

BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.

Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For this example, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which may downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 ERR346600

Primers sequences are available in additional file 1 of the publication.

Read Configuration¶

Schematic of Illumina MiSeq 2x250 stranded paired-end reads without UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains the C-region primer sequence. Both reads begin with a random sequences of 4 nucleotides.

Example Data¶

We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format, with accompanying primer files and an example workflow script. The sample data set and workflow script may be downloaded from here:

Greiff et al, 2014 Example Files

Overview of the Workflow¶

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

Paired-end assembly

Quality control and primer annotation

Deduplication and filtering

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart¶

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) paired-end assembly, (green) quality control and primer annotation, and deduplication and filtering (blue). The intermediate files output by each tool are not shown for the sake of brevity.

Commands¶

#!/usr/bin/env bash
AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log
FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log
MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log
CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1
SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
ParseLog.py -l FS.log -f ID QUALITY
ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Download Commands

Paired-end assembly¶

Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand of the AssemblePairs tool:

AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log

During assembly we have defined read 2 (V-region) as the head of the sequence (-1) and read 1 as the tail of the sequence (-2). The --coord argument defines the format of the sequence header so that AssemblePairs can properly identify mate-pairs; in this case, we use --coord sra as our headers are in the SRA/ENA format.

Note

For both the AssemblePairs and PairSeq commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then the appropriate argument would be --coord illumina.

The ParseLog tool is then used to build a tab-delimited file of results from the AssemblePairs log file:

13	ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE

Which will containing the following columns:

Field	Description
ID	Sequence name
LENGTH	Length of the assembled sequence
OVERLAP	Length of the overlap between mate-pairs
ERROR	Mismatch rate of the overlapping region
PVALUE	P-value for the assembly

Quality control and primer annotation¶

Removal of low quality reads¶

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log

The ParseLog tool is then used to build tab-delimited file from the FilterSeq log:

14	ParseLog.py -l FS.log -f ID QUALITY

Capturing the following annotations:

Field	Description
ID	Sequence name
QUALITY	Quality score

Read annotation and masking of primer regions¶

When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The score subcommand of MaskPrimers is used to identify and remove the V-segment and C-region PCR primers for both reads:

MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log

In this data set the authors have added a random sequence of 4 bp to the start of each read before the primer sequence to increase sequence diversity and the reliability of cluster calling on the Illumina platform. As such, both primers begin at position 4 (--start 4), but the C-region primer begins 4 bases from the end of the assembled read. The addition of the --revpr argument to the second MaskPrimers step instructs the tool to reverse complement the primer sequences and check the tail of the read. The two primer regions have also been treated differently. The V-segment primer has been masked (replaced by Ns) using the --mode mask argument to preserve the V(D)J length, while the C-region primer has been removed from the sequence using the --mode cut argument.

During each iteration of the MaskPrimers tool an annotation field was added with the name of the primer, with the name of the field specified by the --pf argument, such that after both iterations each sequence contains an annotation of the form:

VPRIMER=V-segment primer|CPRIMER=C-region primer

To summarize these steps, the ParseLog tool is used to build tab-delimited files from the two MaskPrimers logs:

15	ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Capturing the following annotations:

Field	Description
ID	Sequence name
PRIMER	Primer name
ERROR	Primer match error rate

Note

This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read 1 is always the C-region end of the amplicon and read 2 is always the V-segment end. If your data is unstranded (50% of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers step to account for this by using the align subcommand instead:

MaskPrimers.py align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --maxlen 30 --mode mask --pf VPRIMER --log MP1.log

This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read for matches, and correcting the the output sequences to the forward orientation (V to J).

Deduplication and filtering¶

Removal of duplicate sequences¶

The last stage of the workflow involves two filtering steps to yield the final repertoire. First, the set of unique sequences is identified using the CollapseSeq tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicates share the same C-region primer annotation (--uf CPRIMER). Additionally, the V-segment primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1

Filtering to repeated sequences¶

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1

Creating an annotation table¶

Finally, the annotations, including duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-segment primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimited file using the table subcommand of ParseHeaders:

12	ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER

Output files¶

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File	Description
M1_collapse-unique.fastq	Total unique sequences
M1_atleast-2.fastq	Unique sequences represented by at least 2 reads
M1_atleast-2_headers.tab	Annotation table of the atleast-2 file
AP_table.tab	Table of the AssemblePairs log
FS_table.tab	Table of the FilterSeq log
MPV_table.tab	Table of the V-segment MaskPrimers log
MPC_table.tab	Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.