Roche 454 BCR mRNA with Multiplexed Samples

Overview of Experimental Data

This example uses publicly available data from:

Lineage structure of the human antibody repertoire in response to influenza vaccination.
Jiang N, He J, and Weinstein JA, et al.
Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.

Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this example, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump -X 50000 SRR765688

Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available in the published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverse primers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.

Read Configuration

../_images/Jiang2013_ReadConfiguration.svg

Schematic of the Roche 454 read configuration. The start of each sequences is labeled with the sample barcode. The following sequence may either be in the forward (top) orientation, proceeding 5’ to 3’ in the direction of the V(D)J reading frame, or the reverse complement orientation (bottom), proceeding in the opposite direction.

Example Data

We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, with accompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloaded from here:

Jiang, He and Weinstein et al, 2013 Example Files

Overview of the Workflow

The example that follows performs all processing steps to arrive at high-quality unique sequences using this example data set. The workflow is derived into three high level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart

../_images/Jiang2013_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) quality control, (green) sample barcode and primer identification, (blue) deduplication and filtering of the repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env bash
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --outname S43-MID --log MPM.log
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --outname S43-FWD --log MPV.log
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --revpr --skiprc --mode cut \
    --outname S43-REV --log MPC.log
ParseHeaders.py expand -s S43-REV_primers-pass.fastq -f PRIMER
ParseHeaders.py rename -s S43-REV*reheader.fastq -f PRIMER1 PRIMER2 PRIMER3 \
    -k MID VPRIMER CPRIMER --outname S43
CollapseSeq.py -s S43_reheader.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
ParseLog.py -l FSL.log -f ID LENGTH
ParseLog.py -l FSQ.log -f ID QUALITY
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Download Commands

Quality control

The initial stage of the workflow involves two executions of the FilterSeq tool. First, the length subcommand is used to filter reads which are too short to yield full V(D)J sequences using a liberal minimum length requirement of 300 bp (-n 300):

2
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log

Next, the quality subcommand removes sequences having a mean Phred quality score below 20 (-q 20).

3
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log

The ParseLog tool is then used to extract the results from the two FilterSeq log files:

18
19
ParseLog.py -l FSL.log -f ID LENGTH
ParseLog.py -l FSQ.log -f ID QUALITY

To create two tab-delimited files containing the following results for each read:

Field Description
ID Sequence name
LENGTH Sequence length
QUALITY Quality score

Sample barcode and primer identification

Annotation of sample barcodes

Following the initial filtering steps, additional filtering is performed with three iterations of the MaskPrimers tool based upon the presence of recognized sample barcode (MID), forward primer, and reverse primer sequences. As the orientation and position of the sample barcode is known, the first pass through MaskPrimers uses the faster score subcommand which requires a fixed start position (--start 0) and a low allowable error rate (--maxerror 0.1):

4
5
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --outname S43-MID --log MPM.log

Primer masking and annotation

The next MaskPrimers task uses the align subcommand to identify both the start position of the V-region primer and correct the orientation of the sequence such that all reads are now oriented in the direction of the V(D)J reading frame), as determined by the orientation of the V-region primer match:

6
7
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --outname S43-FWD --log MPV.log

The final MaskPrimers task locates the C-region primer, which is used for isotype assignment of each read. As all sequences are assumed to have been properly oriented by the second MaskPrimers task, the additional arguments MaskPrimers align --revpr and MaskPrimers align --skiprc are added to the third execution. The MaskPrimers align --revpr argument informs the tool that primers sequences should be reverse complemented prior to alignment, and that a match should be searched for (and cut from) the tail end of the sequence. The MaskPrimers align --skiprc argument tells the tool to align against only the forward sequence; meaning, it will not check primer matches against the reverse complement sequence and it will not reorient sequences:

 8
 9
10
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --revpr --skiprc --mode cut \
    --outname S43-REV --log MPC.log

At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file of each MaskPrimers tasks:

20
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Which will contain the following information for each log file:

Field Description
ID Sequence name
PRIMER Primer or sample barcode name
ERROR Primer match error rate

Modifying the sample barcode and primer annotations

During each iteration of the MaskPrimers tool the PRIMER annotation field was updated with an additional value, such that after three iterations each sequences contains an annotation of the form:

PRIMER=sample barcode,V-region primer,C-region primer

To simplify later analysis, the ParseHeaders tool is used to first expand this single annotation into three separate annotations using the expand subcommand, which are then renaming to MID, VPRIMER, and CPRIMER using the rename subcommand:

11
12
13
ParseHeaders.py expand -s S43-REV_primers-pass.fastq -f PRIMER
ParseHeaders.py rename -s S43-REV*reheader.fastq -f PRIMER1 PRIMER2 PRIMER3 \
    -k MID VPRIMER CPRIMER --outname S43

Deduplication and filtering

Removal of duplicate sequences

The final stage of the workflow involves two filtering steps to yield unique sequences for each sample barcode. First, the set of unique sequences is identified using the CollapseSeq tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicated share the same isotype and sample barcode tag (--uf MID CPRIMER). Additionally, the V-region primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

14
15
CollapseSeq.py -s S43_reheader.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

16
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43

Creating an annotation table

Finally, the annotations, including the sample barcode (MID), duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-region primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimited file using the table subcommand of ParseHeaders:

17
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER

Note

Optionally, you may split each sample into separate files using the MID annotation and an alternate invocation of SplitSeq. The group subcommand may be used to split files on a categorical field, rather than a numerical field, by skipping the --num argument:

SplitSeq.py group -s M1_collapse-unique.fastq -f MID

Will split the unique sequence file into a set of separate files according the the valud in the MID field (-f MID), such that each file will contain sequences from only one sample.

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File Description
S43_collapse-unique.fastq Total unique sequences
S43_atleast-2.fastq Unique sequences represented by at least 2 reads
S43_atleast-2_headers.tab Annotation table of the atleast-2 file
FSL_table.tab Table of the FilterSeq-length log
FSQ_table.tab Table of the FilterSeq-quality log
MPM_table.tab Table of the MID MaskPrimers log
MPV_table.tab Table of the V-region MaskPrimers log
MPC_table.tab Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Performance

Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 sample barcodes, 11 V-segment primers, and 5 constant region primers.

Line Tool Reads Cores MB Minutes
01 FilterSeq.py length 1,346,039 10 669 12.2
02 FilterSeq.py quality 1,184,905 10 619 11.5
03 MaskPrimers.py score 1,184,219 10 621 18.5
04 MaskPrimers.py align 1,177,245 10 618 263.1
05 MaskPrimers.py align 959,920 10 563 122.9
07 ParseHeaders.py expand 548,658 1 329 3.6
08 ParseHeaders.py rename 548,658 1 329 4.1
09 CollapseSeq.py 548,658 1 1,509 5.6