Roche 454 BCR mRNA with Multiplexed Samples
Overview of Experimental Data
This example uses publicly available data from:
Lineage structure of the human antibody repertoire in response to influenza vaccination.Jiang N, He J, and Weinstein JA, et al.Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.
Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this example, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloaded using fastq-dump from the SRA Toolkit:
fastq-dump -X 50000 SRR765688
Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available in the published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverse primers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.
Read Configuration
Schematic of the Roche 454 read configuration. The start of each sequences is labeled with the sample barcode. The following sequence may either be in the forward (top) orientation, proceeding 5’ to 3’ in the direction of the V(D)J reading frame, or the reverse complement orientation (bottom), proceeding in the opposite direction.
Example Data
We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, with accompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow
The example that follows performs all processing steps to arrive at high-quality unique sequences using this example data set. The workflow is derived into three high level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart
Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) quality control, (green) sample barcode and primer identification, (blue) deduplication and filtering of the repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.
Commands
1#!/usr/bin/env bash
2FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
3FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
4MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
5 --start 0 --maxerror 0.1 --mode cut --pf MID \
6 --outname S43-MID --log MPM.log
7MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
8 --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
9 --outname S43-FWD --log MPV.log
10MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
11 --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
12 --outname S43-REV --log MPC.log
13CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
14 --cf VPRIMER --act set --outname S43
15SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
16ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
17ParseLog.py -l FSL.log -f ID LENGTH
18ParseLog.py -l FSQ.log -f ID QUALITY
19ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR
Quality control
The initial stage of the workflow involves two executions of the
FilterSeq.py tool. First, the length subcommand is used to filter reads
which are too short to yield full V(D)J sequences using a liberal
minimum length requirement of 300 bp (-n 300
):
2FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
Next, the quality subcommand removes sequences having a mean Phred quality
score below 20 (-q 20
).
3FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
The ParseLog.py tool is then used to extract the results from the two FilterSeq.py log files:
17ParseLog.py -l FSL.log -f ID LENGTH
To create two tab-delimited files containing the following results for each read:
Field |
Description |
---|---|
ID |
Sequence name |
LENGTH |
Sequence length |
QUALITY |
Quality score |
Sample barcode and primer identification
Annotation of sample barcodes
Following the initial filtering steps, additional filtering is performed
with three iterations of the MaskPrimers.py tool based upon the presence of
recognized sample barcode (MID), forward primer, and reverse primer sequences.
As the orientation and position of the sample barcode is known, the
first pass through MaskPrimers uses the faster score subcommand which
requires a fixed start position (--start 0
)
and a low allowable error rate (--maxerror 0.1
).
The name of the MID will be annotated in the sequence header as the MID
field
(--pf MID
):
4MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
5 --start 0 --maxerror 0.1 --mode cut --pf MID \
6 --outname S43-MID --log MPM.log
Primer masking and annotation
The next MaskPrimers.py task uses the align subcommand to identify
both the start position of the V-segment primer and correct the orientation of the
sequence such that all reads are now oriented in the direction of the
V(D)J reading frame), as determined by the orientation of the V-segment primer match.
The name of the V-segment primer will be annotated in the sequence header as the
VPRIMER
field (--pf VPRIMER
):
7MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
8 --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
9 --outname S43-FWD --log MPV.log
The final MaskPrimers task locates the C-region primer, which is used for isotype
assignment of each read. As all sequences are assumed to have been
properly oriented by the second MaskPrimers task, the additional
arguments MaskPrimers align --revpr
and MaskPrimers align --skiprc
are added to the third execution. The MaskPrimers align --revpr
argument
informs the tool that primers sequences should be reverse complemented prior to alignment,
and that a match should be searched for (and cut from) the tail end of the sequence.
The MaskPrimers align --skiprc
argument tells the tool to align against only
the forward sequence; meaning, it will not check primer matches against the reverse complement
sequence and it will not reorient sequences. The name of the C-region primer
will be annotated in the sequence header as the CPRIMER
field
(--pf CPRIMER
):
10MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
11 --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
12 --outname S43-REV --log MPC.log
At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file of each MaskPrimers.py tasks:
19ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR
Which will contain the following information for each log file:
Field |
Description |
---|---|
ID |
Sequence name |
PRIMER |
Primer or sample barcode name |
ERROR |
Primer match error rate |
Deduplication and filtering
Removal of duplicate sequences
The final stage of the workflow involves two filtering steps to yield
unique sequences for each sample barcode. First, the set of unique
sequences is identified using the CollapseSeq.py tool, allowing for up to
20 interior N-valued positions
(-n 20
and --inner
), and
requiring that all reads considered duplicated share the same isotype
and sample barcode tag (--uf MID CPRIMER
).
Additionally, the V-region primer annotations of the set of duplicate reads are
propagated into the annotation of each retained unique sequence
(--cf VPRIMER
and --act set
):
13CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
14 --cf VPRIMER --act set --outname S43
Filtering to repeated sequences
CollapseSeq stores the count of duplicate reads for each sequence in the
DUPCOUNT
annotation. Following duplicate removal, the data is subset
to only those unique sequence with at least two representative reads by
using the group subcommand of SplitSeq.py on the count field
(-f DUPCOUNT
) and specifying a numeric threshold
(--num 2
):
15SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
Creating an annotation table
Finally, the annotations, including the sample barcode (MID
), duplicate read
count (DUPCOUNT
), isotype (CPRIMER
) and V-region primer (VPRIMER
),
of the final repertoire are then extracted from the SplitSeq.py output into a
tab-delimited file using the table subcommand of ParseHeaders.py:
16ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
Note
Optionally, you may split each sample into separate files using the MID
annotation and an alternate invocation of SplitSeq.py. The group
subcommand may be used to split files on a categorical field, rather than a
numerical field, by skipping the --num
argument:
SplitSeq.py group -s M1_collapse-unique.fastq -f MID
Will split the unique sequence file into a set of separate files according the
the valud in the MID
field (-f MID
), such that
each file will contain sequences from only one sample.
Output files
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File |
Description |
---|---|
S43_collapse-unique.fastq |
Total unique sequences |
S43_atleast-2.fastq |
Unique sequences represented by at least 2 reads |
S43_atleast-2_headers.tab |
Annotation table of the atleast-2 file |
FSL_table.tab |
Table of the FilterSeq-length log |
FSQ_table.tab |
Table of the FilterSeq-quality log |
MPM_table.tab |
Table of the MID MaskPrimers log |
MPV_table.tab |
Table of the V-segment MaskPrimers log |
MPC_table.tab |
Table of the C-region MaskPrimers log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.
Performance
Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 sample barcodes, 11 V-segment primers, and 5 C-region primers.
Line |
Tool |
Reads |
Cores |
MB |
Minutes |
---|---|---|---|---|---|
01 |
FilterSeq.py length |
1,346,039 |
10 |
669 |
12.2 |
02 |
FilterSeq.py quality |
1,184,905 |
10 |
619 |
11.5 |
03 |
MaskPrimers.py score |
1,184,219 |
10 |
621 |
18.5 |
04 |
MaskPrimers.py align |
1,177,245 |
10 |
618 |
263.1 |
05 |
MaskPrimers.py align |
959,920 |
10 |
563 |
122.9 |
07 |
ParseHeaders.py expand |
548,658 |
1 |
329 |
3.6 |
08 |
ParseHeaders.py rename |
548,658 |
1 |
329 |
4.1 |
09 |
CollapseSeq.py |
548,658 |
1 |
1,509 |
5.6 |