Roche 454 BCR mRNA with Multiplexed Samples

Overview of Experimental Data

This example uses publicly available data from:

Lineage structure of the human antibody repertoire in response to influenza vaccination.
Jiang N, He J, and Weinstein JA, et al.
Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.

Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this example, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump -X 50000 SRR765688

Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available in the published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverse primers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.

Read Configuration

../_images/Jiang2013_ReadConfiguration.svg

Schematic of the Roche 454 read configuration. The start of each sequences is labeled with the sample barcode. The following sequence may either be in the forward (top) orientation, proceeding 5’ to 3’ in the direction of the V(D)J reading frame, or the reverse complement orientation (bottom), proceeding in the opposite direction.

Example Data

We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, with accompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloaded from here:

Jiang, He and Weinstein et al, 2013 Example Files

Overview of the Workflow

The example that follows performs all processing steps to arrive at high-quality unique sequences using this example data set. The workflow is derived into three high level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart

../_images/Jiang2013_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) quality control, (green) sample barcode and primer identification, (blue) deduplication and filtering of the repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env bash
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
ParseLog.py -l FSL.log -f ID LENGTH
ParseLog.py -l FSQ.log -f ID QUALITY
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Download Commands

Quality control

The initial stage of the workflow involves two executions of the FilterSeq tool. First, the length subcommand is used to filter reads which are too short to yield full V(D)J sequences using a liberal minimum length requirement of 300 bp (-n 300):

2
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log

Next, the quality subcommand removes sequences having a mean Phred quality score below 20 (-q 20).

3
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log

The ParseLog tool is then used to extract the results from the two FilterSeq log files:

17
ParseLog.py -l FSL.log -f ID LENGTH

To create two tab-delimited files containing the following results for each read:

Field

Description

ID

Sequence name

LENGTH

Sequence length

QUALITY

Quality score

Sample barcode and primer identification

Annotation of sample barcodes

Following the initial filtering steps, additional filtering is performed with three iterations of the MaskPrimers tool based upon the presence of recognized sample barcode (MID), forward primer, and reverse primer sequences. As the orientation and position of the sample barcode is known, the first pass through MaskPrimers uses the faster score subcommand which requires a fixed start position (--start 0) and a low allowable error rate (--maxerror 0.1). The name of the MID will be annotated in the sequence header as the MID field (--pf MID):

4
5
6
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log

Primer masking and annotation

The next MaskPrimers task uses the align subcommand to identify both the start position of the V-segment primer and correct the orientation of the sequence such that all reads are now oriented in the direction of the V(D)J reading frame), as determined by the orientation of the V-segment primer match. The name of the V-segment primer will be annotated in the sequence header as the VPRIMER field (--pf VPRIMER):

7
8
9
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log

The final MaskPrimers task locates the C-region primer, which is used for isotype assignment of each read. As all sequences are assumed to have been properly oriented by the second MaskPrimers task, the additional arguments MaskPrimers align --revpr and MaskPrimers align --skiprc are added to the third execution. The MaskPrimers align --revpr argument informs the tool that primers sequences should be reverse complemented prior to alignment, and that a match should be searched for (and cut from) the tail end of the sequence. The MaskPrimers align --skiprc argument tells the tool to align against only the forward sequence; meaning, it will not check primer matches against the reverse complement sequence and it will not reorient sequences. The name of the C-region primer will be annotated in the sequence header as the CPRIMER field (--pf CPRIMER):

10
11
12
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log

At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file of each MaskPrimers tasks:

19
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Which will contain the following information for each log file:

Field

Description

ID

Sequence name

PRIMER

Primer or sample barcode name

ERROR

Primer match error rate

Deduplication and filtering

Removal of duplicate sequences

The final stage of the workflow involves two filtering steps to yield unique sequences for each sample barcode. First, the set of unique sequences is identified using the CollapseSeq tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicated share the same isotype and sample barcode tag (--uf MID CPRIMER). Additionally, the V-region primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

13
14
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

15
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43

Creating an annotation table

Finally, the annotations, including the sample barcode (MID), duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-region primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimited file using the table subcommand of ParseHeaders:

16
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER

Note

Optionally, you may split each sample into separate files using the MID annotation and an alternate invocation of SplitSeq. The group subcommand may be used to split files on a categorical field, rather than a numerical field, by skipping the --num argument:

SplitSeq.py group -s M1_collapse-unique.fastq -f MID

Will split the unique sequence file into a set of separate files according the the valud in the MID field (-f MID), such that each file will contain sequences from only one sample.

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File

Description

S43_collapse-unique.fastq

Total unique sequences

S43_atleast-2.fastq

Unique sequences represented by at least 2 reads

S43_atleast-2_headers.tab

Annotation table of the atleast-2 file

FSL_table.tab

Table of the FilterSeq-length log

FSQ_table.tab

Table of the FilterSeq-quality log

MPM_table.tab

Table of the MID MaskPrimers log

MPV_table.tab

Table of the V-segment MaskPrimers log

MPC_table.tab

Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Performance

Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 sample barcodes, 11 V-segment primers, and 5 C-region primers.

Line

Tool

Reads

Cores

MB

Minutes

01

FilterSeq.py length

1,346,039

10

669

12.2

02

FilterSeq.py quality

1,184,905

10

619

11.5

03

MaskPrimers.py score

1,184,219

10

621

18.5

04

MaskPrimers.py align

1,177,245

10

618

263.1

05

MaskPrimers.py align

959,920

10

563

122.9

07

ParseHeaders.py expand

548,658

1

329

3.6

08

ParseHeaders.py rename

548,658

1

329

4.1

09

CollapseSeq.py

548,658

1

1,509

5.6