Roche 454 BCR mRNA with Multiplexed Samples

Overview of Experimental Data

This example uses publicly available data from:

Lineage structure of the human antibody repertoire in response to influenza vaccination.

Jiang N, He J, and Weinstein JA, et al.

Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.

Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this example, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump -X 50000 SRR765688

Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available in the published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverse primers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.

Read Configuration

Example Data

We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, with accompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloaded from here:

Jiang, He and Weinstein et al, 2013 Example Files

Overview of the Workflow

The example that follows performs all processing steps to arrive at high-quality unique sequences using this example data set. The workflow is derived into three high level tasks:

Quality control

Sample barcode and primer identification

Deduplication and filtering

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart

Commands

#!/usr/bin/env bash
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
ParseLog.py -l FSL.log -f ID LENGTH
ParseLog.py -l FSQ.log -f ID QUALITY
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Download Commands

Quality control

The initial stage of the workflow involves two executions of the FilterSeq.py tool. First, the length subcommand is used to filter reads which are too short to yield full V(D)J sequences using a liberal minimum length requirement of 300 bp (-n 300):

2FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log

Next, the quality subcommand removes sequences having a mean Phred quality score below 20 (-q 20).

FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log

The ParseLog.py tool is then used to extract the results from the two FilterSeq.py log files:

17ParseLog.py -l FSL.log -f ID LENGTH

To create two tab-delimited files containing the following results for each read:

Field	Description
ID	Sequence name
LENGTH	Sequence length
QUALITY	Quality score

Sample barcode and primer identification

Annotation of sample barcodes

Following the initial filtering steps, additional filtering is performed with three iterations of the MaskPrimers.py tool based upon the presence of recognized sample barcode (MID), forward primer, and reverse primer sequences. As the orientation and position of the sample barcode is known, the first pass through MaskPrimers uses the faster score subcommand which requires a fixed start position (--start 0) and a low allowable error rate (--maxerror 0.1). The name of the MID will be annotated in the sequence header as the MID field (--pf MID):

MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log

Primer masking and annotation

The next MaskPrimers.py task uses the align subcommand to identify both the start position of the V-segment primer and correct the orientation of the sequence such that all reads are now oriented in the direction of the V(D)J reading frame), as determined by the orientation of the V-segment primer match. The name of the V-segment primer will be annotated in the sequence header as the VPRIMER field (--pf VPRIMER):

MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log

The final MaskPrimers task locates the C-region primer, which is used for isotype assignment of each read. As all sequences are assumed to have been properly oriented by the second MaskPrimers task, the additional arguments MaskPrimers align --revpr and MaskPrimers align --skiprc are added to the third execution. The MaskPrimers align --revpr argument informs the tool that primers sequences should be reverse complemented prior to alignment, and that a match should be searched for (and cut from) the tail end of the sequence. The MaskPrimers align --skiprc argument tells the tool to align against only the forward sequence; meaning, it will not check primer matches against the reverse complement sequence and it will not reorient sequences. The name of the C-region primer will be annotated in the sequence header as the CPRIMER field (--pf CPRIMER):

MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log

At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file of each MaskPrimers.py tasks:

19ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Which will contain the following information for each log file:

Field	Description
ID	Sequence name
PRIMER	Primer or sample barcode name
ERROR	Primer match error rate

Deduplication and filtering

Removal of duplicate sequences

The final stage of the workflow involves two filtering steps to yield unique sequences for each sample barcode. First, the set of unique sequences is identified using the CollapseSeq.py tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicated share the same isotype and sample barcode tag (--uf MID CPRIMER). Additionally, the V-region primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq.py on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43

Creating an annotation table

Finally, the annotations, including the sample barcode (MID), duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-region primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq.py output into a tab-delimited file using the table subcommand of ParseHeaders.py:

ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER

Note

Optionally, you may split each sample into separate files using the MID annotation and an alternate invocation of SplitSeq.py. The group subcommand may be used to split files on a categorical field, rather than a numerical field, by skipping the --num argument:

SplitSeq.py group -s M1_collapse-unique.fastq -f MID

Will split the unique sequence file into a set of separate files according the the valud in the MID field (-f MID), such that each file will contain sequences from only one sample.

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File	Description
S43_collapse-unique.fastq	Total unique sequences
S43_atleast-2.fastq	Unique sequences represented by at least 2 reads
S43_atleast-2_headers.tab	Annotation table of the atleast-2 file
FSL_table.tab	Table of the FilterSeq-length log
FSQ_table.tab	Table of the FilterSeq-quality log
MPM_table.tab	Table of the MID MaskPrimers log
MPV_table.tab	Table of the V-segment MaskPrimers log
MPC_table.tab	Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Performance

Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 sample barcodes, 11 V-segment primers, and 5 C-region primers.

Line	Tool	Reads	Cores	MB	Minutes
01	FilterSeq.py length	1,346,039	10	669	12.2
02	FilterSeq.py quality	1,184,905	10	619	11.5
03	MaskPrimers.py score	1,184,219	10	621	18.5
04	MaskPrimers.py align	1,177,245	10	618	263.1
05	MaskPrimers.py align	959,920	10	563	122.9
07	ParseHeaders.py expand	548,658	1	329	3.6
08	ParseHeaders.py rename	548,658	1	329	4.1
09	CollapseSeq.py	548,658	1	1,509	5.6