pRESTO - The REpertoire Sequencing TOolkit¶
pRESTO is a toolkit for processing raw reads from high-throughput sequencing of B cell and T cell repertoires.
Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of lymphocyte repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. The REpertoire Sequencing TOolkit (pRESTO) is composed of a suite of utilities to handle all stages of sequence processing prior to germline segment assignment. pRESTO is designed to handle either single reads or paired-end reads. It includes features for quality control, primer masking, annotation of reads with sequence embedded barcodes, generation of unique molecular identifier (UMI) consensus sequences, assembly of paired-end reads and identification of duplicate sequences. Numerous options for sequence sorting, sampling and conversion operations are also included.
Overview¶
Scope and Features¶
pRESTO performs all stages of raw sequence processing prior to alignment against reference germline sequences. The toolkit is intended to be easy to use, but some familiarity with commandline applications is expected. Rather than providing a fixed solution to a small number of common workflows, we have designed pRESTO to be as flexible as possible. This design philosophy makes pRESTO suitable for many existing protocols and adaptable to future technologies, but requires users to construct a sequence of commands and options specific to their experimental protocol.
pRESTO is composed of a set of standalone tools to perform specific tasks, often with a series of subcommands providing different behaviors. A brief description of each tool is shown in the table below.
Tool | Subcommand | Description |
---|---|---|
AlignSets | Multiple aligns sets of sequences sharing the same annotation | |
muscle | Uses the program MUSCLE to align reads | |
offset | Uses a table of primer alignments to align the 5’ region | |
table | Creates a table of primer alignments for the offset subcommand | |
AssemblePairs | Assembles paired-end reads into a complete sequence | |
align | Assembles paired-end reads by aligning the sequence ends | |
join | Concatenates pair-end reads with intervening gaps | |
reference | Assembles paired-end reads using V-segment references | |
sequential | Attempt alignment assembly followed by reference assembly | |
BuildConsensus | Constructs UMI consensus sequences | |
ClusterSets | Clusters read groups | |
all | Cluster all sequences regardless of annotation | |
barcode | Cluster reads by clustering barcode sequences | |
set | Cluster reads by sequence data within barcode groups | |
CollapseSeq | Removes duplicate sequences | |
ConvertHeaders | Converts sequence headers to the pRESTO format | |
generic | Converts sequence headers with an unknown annotation system | |
454 | Converts Roche 454 sequence headers | |
genbank | Converts NCBI GenBank and RefSeq sequence headers | |
illumina | Converts Illumina sequence headers | |
imgt | Converts sequence headers output by IMGT/GENE-DB | |
migec | Converts sequence headers output by MIGEC | |
sra | Converts NCBI SRA or EMBL-EBI ENA sequence headers | |
EstimateError | Estimates error rates for UMI data | |
FilterSeq | Removes or modifies low quality reads | |
length | Removes sequences under a defined length | |
maskqual | Masks low Phred quality score positions with Ns | |
missing | Removes sequences with a high number of Ns | |
repeats | Removes sequences with long repeats of a single nucleotide | |
quality | Removes sequences with low Phred quality scores | |
trimqual | Trims sequences to segments with high Phred quality scores | |
MaskPrimers | Identifies and removes primer regions, MIDs and UMI barcodes | |
align | Matches primers by local alignment and reorients sequences | |
score | Matches primers at a fixed user-defined start position | |
PairSeq | Sorts paired-end reads and copies annotations between them | |
ParseHeaders | Manipulates sequence annotations | |
add | Adds a field and value annotation pair to all reads | |
collapse | Compresses a set of annotation fields into a single field | |
copy | Copies values between annotations fields | |
delete | Deletes an annotation from all reads | |
expand | Expands an field with multiple values into separate annotations | |
merge | Merge multiple annotations fields into a single field | |
rename | Rename annotation fields | |
table | Outputs sequence annotations as a data table | |
ParseLog | Converts the log output of pRESTO scripts into data tables | |
SplitSeq | Performs conversion, sorting, and subsetting of sequence files | |
count | Splits files into smaller files | |
group | Splits files based on numerical or categorical annotation | |
sample | Randomly samples sequences from a file | |
samplepair | Randomly samples paired-end reads from two files | |
select | Filters sequences based on annotations | |
sort | Sorts sequences based on annotations | |
UnifyHeaders | Unifies annotation fields based on grouping scheme | |
consensus | Reassign fields to consensus values | |
delete | Delete sequences with differing field values. |
Input and Output¶
All tools take as input standard FASTA or FASTQ formatted files and output files in the same formats. This allows pRESTO to work seamlessly with other sequence processing tools that use either of these data formats; any steps within a pRESTO workflow can be exchanged for an alternate tool, if desired.
Each tool appends a specific suffix to its output files describing the step and
output. For example, MaskPrimers will append _primers-pass
to the output
file containing successfully aligned sequences and _primers-fail
to the file
containing unaligned sequences.
See also
Details regarding the suffixes used by pRESTO tools can be found in the Commandline Usage documentation for each tool.
Annotation Scheme¶
The majority of pRESTO tools manipulate and add sequences-specific annotations
as part of their processing functions using the scheme shown below. Each
annotation is delimited using a reserved character (|
by default), with the
annotation field name and values separated by a second reserved character
(=
by default), and each value within a field is separated by a third
reserved character (,
by default). These annotations follow the sequence
identifier, which itself immediately follows the >
(FASTA) or @
(FASTQ)
symbol denoting the beginning of a new sequence entry. The sequence identifier
is given the reserved field name ID
. To mitigate potential analysis
errors, each tool in pRESTO annotates sequences by appending values to existing
annotation fields when they exist, and will not overwrite or delete annotations
unless explicitly performed using the ParseHeaders tool. All reserved characters
can be redefined using the command line options.
>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
+
!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo
See also
- Details regarding the annotations added by pRESTO tools can be found in the Commandline Usage documentation for each tool.
- The ParseHeaders tool provides a number of options for manipulating annotations in the pRESTO format.
- The ConvertHeaders tool allows you convert several common annotation schemes into the pRESTO annotation format.
Download¶
The latest stable release of pRESTO may be downloaded from PyPI or Bitbucket.
Development versions and source code are available on Bitbucket.
Installation¶
The simplest way to install the latest stable release of pRESTO is via pip:
> pip3 install presto --user
The current development build can be installed using pip and mercurial in similar fashion:
> pip3 install hg+https://bitbucket.org/kleinstein/presto#default --user
If you currently have a development version installed, then you will likely
need to add the arguments --upgrade --no-deps --force-reinstall
to the
pip3 command.
Requirements¶
- Python 3.4.0
- setuptools 2.0
- NumPy 1.8
- SciPy 0.14
- pandas 0.15
- Biopython 1.65
- AlignSets requires MUSCLE v3.8
- ClusterSets USEARCH v7.0, vsearch v2.3.2, or CD-HIT v4.6.8
- AssemblePairs-reference requires USEARCH v7.0 or BLAST+ 2.5
Linux¶
The simplest way to install all Python dependencies is to install the full SciPy stack using the instructions, then install Biopython according to its instructions.
Download the pRESTO bundle and run:
> pip3 install presto-x.y.z.tar.gz --user
Mac OS X¶
Install Xcode. Available from the Apple store or developer downloads.
Older versions Mac OS X will require you to install XQuartz 2.7.5. Available from the XQuartz project.
Install Homebrew following the installation and post-installation instructions.
Install Python 3.4.0+ and set the path to the python3 executable:
> brew install python3 > echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile
Exit and reopen the terminal application so the PATH setting takes effect.
You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hour to install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it using Homebrew:
> brew install gfortran
If the above fails run this instead:
> brew install --env=std gfortran
Install NumPy, SciPy, pandas and Biopyton using the Python package manager:
> pip3 install numpy scipy pandas biopython
Download pRESTO bundle, open a terminal window, change directories to download location, and run:
> pip3 install presto-x.y.z.tar.gz
Windows¶
Install Python 3.4.0+ from Python, selecting both the options ‘pip’ and ‘Add python.exe to Path’.
Install NumPy, SciPy, pandas and Biopython using the packages available from the Unofficial Windows binary collection.
Download pRESTO bundle, open a Command Prompt, change directories to the download folder, and run:
> pip install presto-x.y.z.tar.gz
For a default installation of Python 3.4, the pRESTO scripts will be installed into
C:\Python34\Scripts
and should be directly executable from the Command Prompt. If this is not the case, then follow step 5 below.Add both the
C:\Python34
andC:\Python34\Scripts
directories to your%Path%
. On Windows 7 the%Path%
setting is located under Control Panel -> System and Security -> System -> Advanced System Settings -> Environment variables -> System variables -> Path.If you have trouble with the
.py
file associations, try adding.PY
to yourPATHEXT
environment variable. Also, opening a command prompt as Administrator and run:> assoc .py=Python.File > ftype Python.File="C:\Python34\python.exe" "%1" %*
Workflows¶
Roche 454 BCR mRNA with Multiplexed Samples¶
Overview of Experimental Data¶
This example uses publicly available data from:
Lineage structure of the human antibody repertoire in response to influenza vaccination.Jiang N, He J, and Weinstein JA, et al.Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.
Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this example, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloaded using fastq-dump from the SRA Toolkit:
fastq-dump -X 50000 SRR765688
Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available in the published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverse primers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.
Read Configuration¶
Schematic of the Roche 454 read configuration. The start of each sequences is labeled with the sample barcode. The following sequence may either be in the forward (top) orientation, proceeding 5’ to 3’ in the direction of the V(D)J reading frame, or the reverse complement orientation (bottom), proceeding in the opposite direction.
Example Data¶
We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, with accompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow¶
The example that follows performs all processing steps to arrive at high-quality unique sequences using this example data set. The workflow is derived into three high level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart¶
Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) quality control, (green) sample barcode and primer identification, (blue) deduplication and filtering of the repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.
Commands¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #!/usr/bin/env bash
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
--start 0 --maxerror 0.1 --mode cut --pf MID \
--outname S43-MID --log MPM.log
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
--maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
--outname S43-FWD --log MPV.log
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
--maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
--outname S43-REV --log MPC.log
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
--cf VPRIMER --act set --outname S43
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
ParseLog.py -l FSL.log -f ID LENGTH
ParseLog.py -l FSQ.log -f ID QUALITY
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR
|
Quality control¶
The initial stage of the workflow involves two executions of the
FilterSeq tool. First, the length subcommand is used to filter reads
which are too short to yield full V(D)J sequences using a liberal
minimum length requirement of 300 bp (-n 300
):
2 | FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
|
Next, the quality subcommand removes sequences having a mean Phred quality
score below 20 (-q 20
).
3 | FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
|
The ParseLog tool is then used to extract the results from the two FilterSeq log files:
17 | ParseLog.py -l FSL.log -f ID LENGTH
|
To create two tab-delimited files containing the following results for each read:
Field | Description |
---|---|
ID | Sequence name |
LENGTH | Sequence length |
QUALITY | Quality score |
Sample barcode and primer identification¶
Annotation of sample barcodes¶
Following the initial filtering steps, additional filtering is performed
with three iterations of the MaskPrimers tool based upon the presence of
recognized sample barcode (MID), forward primer, and reverse primer sequences.
As the orientation and position of the sample barcode is known, the
first pass through MaskPrimers uses the faster score subcommand which
requires a fixed start position (--start 0
)
and a low allowable error rate (--maxerror 0.1
).
The name of the MID will be annotated in the sequence header as the MID
field
(--pf MID
):
4 5 6 | MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
--start 0 --maxerror 0.1 --mode cut --pf MID \
--outname S43-MID --log MPM.log
|
Primer masking and annotation¶
The next MaskPrimers task uses the align subcommand to identify
both the start position of the V-segment primer and correct the orientation of the
sequence such that all reads are now oriented in the direction of the
V(D)J reading frame), as determined by the orientation of the V-segment primer match.
The name of the V-segment primer will be annotated in the sequence header as the
VPRIMER
field (--pf VPRIMER
):
7 8 9 | MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
--maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
--outname S43-FWD --log MPV.log
|
The final MaskPrimers task locates the C-region primer, which is used for isotype
assignment of each read. As all sequences are assumed to have been
properly oriented by the second MaskPrimers task, the additional
arguments MaskPrimers align --revpr
and MaskPrimers align --skiprc
are added to the third execution. The MaskPrimers align --revpr
argument
informs the tool that primers sequences should be reverse complemented prior to alignment,
and that a match should be searched for (and cut from) the tail end of the sequence.
The MaskPrimers align --skiprc
argument tells the tool to align against only
the forward sequence; meaning, it will not check primer matches against the reverse complement
sequence and it will not reorient sequences. The name of the C-region primer
will be annotated in the sequence header as the CPRIMER
field
(--pf CPRIMER
):
10 11 12 | MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
--maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
--outname S43-REV --log MPC.log
|
At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file of each MaskPrimers tasks:
19 | ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR
|
Which will contain the following information for each log file:
Field | Description |
---|---|
ID | Sequence name |
PRIMER | Primer or sample barcode name |
ERROR | Primer match error rate |
Deduplication and filtering¶
Removal of duplicate sequences¶
The final stage of the workflow involves two filtering steps to yield
unique sequences for each sample barcode. First, the set of unique
sequences is identified using the CollapseSeq tool, allowing for up to
20 interior N-valued positions
(-n 20
and --inner
), and
requiring that all reads considered duplicated share the same isotype
and sample barcode tag (--uf MID CPRIMER
).
Additionally, the V-region primer annotations of the set of duplicate reads are
propagated into the annotation of each retained unique sequence
(--cf VPRIMER
and --act set
):
13 14 | CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
--cf VPRIMER --act set --outname S43
|
Filtering to repeated sequences¶
CollapseSeq stores the count of duplicate reads for each sequence in the
DUPCOUNT
annotation. Following duplicate removal, the data is subset
to only those unique sequence with at least two representative reads by
using the group subcommand of SplitSeq on the count field
(-f DUPCOUNT
) and specifying a numeric threshold
(--num 2
):
15 | SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
|
Creating an annotation table¶
Finally, the annotations, including the sample barcode (MID
), duplicate read
count (DUPCOUNT
), isotype (CPRIMER
) and V-region primer (VPRIMER
),
of the final repertoire are then extracted from the SplitSeq output into a
tab-delimited file using the table subcommand of ParseHeaders:
16 | ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
|
Note
Optionally, you may split each sample into separate files using the MID
annotation and an alternate invocation of SplitSeq. The group
subcommand may be used to split files on a categorical field, rather than a
numerical field, by skipping the --num
argument:
SplitSeq.py group -s M1_collapse-unique.fastq -f MID
Will split the unique sequence file into a set of separate files according the
the valud in the MID
field (-f MID
), such that
each file will contain sequences from only one sample.
Output files¶
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File | Description |
---|---|
S43_collapse-unique.fastq | Total unique sequences |
S43_atleast-2.fastq | Unique sequences represented by at least 2 reads |
S43_atleast-2_headers.tab | Annotation table of the atleast-2 file |
FSL_table.tab | Table of the FilterSeq-length log |
FSQ_table.tab | Table of the FilterSeq-quality log |
MPM_table.tab | Table of the MID MaskPrimers log |
MPV_table.tab | Table of the V-segment MaskPrimers log |
MPC_table.tab | Table of the C-region MaskPrimers log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.
Performance¶
Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 sample barcodes, 11 V-segment primers, and 5 C-region primers.
Line | Tool | Reads | Cores | MB | Minutes |
---|---|---|---|---|---|
01 | FilterSeq.py length | 1,346,039 | 10 | 669 | 12.2 |
02 | FilterSeq.py quality | 1,184,905 | 10 | 619 | 11.5 |
03 | MaskPrimers.py score | 1,184,219 | 10 | 621 | 18.5 |
04 | MaskPrimers.py align | 1,177,245 | 10 | 618 | 263.1 |
05 | MaskPrimers.py align | 959,920 | 10 | 563 | 122.9 |
07 | ParseHeaders.py expand | 548,658 | 1 | 329 | 3.6 |
08 | ParseHeaders.py rename | 548,658 | 1 | 329 | 4.1 |
09 | CollapseSeq.py | 548,658 | 1 | 1,509 | 5.6 |
Illumina MiSeq 2x250 BCR mRNA¶
Overview of Experimental Data¶
This example uses publicly available data from:
Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice.Greiff, V. et al.BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.
Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For this example, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which may downloaded using fastq-dump from the SRA Toolkit:
fastq-dump --split-files -X 25000 ERR346600
Primers sequences are available in additional file 1 of the publication.
Read Configuration¶
Schematic of Illumina MiSeq 2x250 stranded paired-end reads without UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains the C-region primer sequence. Both reads begin with a random sequences of 4 nucleotides.
Example Data¶
We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format, with accompanying primer files and an example workflow script. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow¶
In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart¶
Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) paired-end assembly, (green) quality control and primer annotation, and deduplication and filtering (blue). The intermediate files output by each tool are not shown for the sake of brevity.
Commands¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | #!/usr/bin/env bash
AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
--coord sra --rc tail --outname M1 --log AP.log
FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log
MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
--start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
--start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log
CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
--cf VPRIMER --act set --outname M1
SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
ParseLog.py -l FS.log -f ID QUALITY
ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR
|
Paired-end assembly¶
Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand of the AssemblePairs tool:
2 3 | AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
--coord sra --rc tail --outname M1 --log AP.log
|
During assembly we have defined read 2 (V-region) as the head of the sequence
(-1
) and read 1 as the tail of the sequence
(-2
). The
--coord
argument defines the format of
the sequence header so that AssemblePairs can properly identify mate-pairs;
in this case, we use --coord sra
as our headers are in the SRA/ENA format.
Note
For both the AssemblePairs and PairSeq commands using the
correct --coord
argument is critical
for matching mate-pairs. If this was raw data from Illumina, rather than
data downloaded from SRA/ENA, then the appropriate argument would be
--coord illumina
.
The ParseLog tool is then used to build a tab-delimited file of results from the AssemblePairs log file:
13 ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
Which will containing the following columns:
Field | Description |
---|---|
ID | Sequence name |
LENGTH | Length of the assembled sequence |
OVERLAP | Length of the overlap between mate-pairs |
ERROR | Mismatch rate of the overlapping region |
PVALUE | P-value for the assembly |
See also
Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.
Quality control and primer annotation¶
Removal of low quality reads¶
Quality control begins with the identification and removal of
low-quality reads using the quality subcommand of the FilterSeq tool.
In this example, reads with mean Phred quality scores less than
20 (-q 20
) are removed:
4 | FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log
|
The ParseLog tool is then used to build tab-delimited file from the FilterSeq log:
14 | ParseLog.py -l FS.log -f ID QUALITY
|
Capturing the following annotations:
Field | Description |
---|---|
ID | Sequence name |
QUALITY | Quality score |
Read annotation and masking of primer regions¶
When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The score subcommand of MaskPrimers is used to identify and remove the V-segment and C-region PCR primers for both reads:
5 6 7 8 | MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
--start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
--start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log
|
In this data set the authors have added a random sequence of 4 bp to the
start of each read before the primer sequence to increase sequence diversity
and the reliability of cluster calling on the Illumina platform.
As such, both primers begin at position 4 (--start 4
),
but the C-region primer begins 4 bases from the end of the assembled read.
The addition of the --revpr
argument to the
second MaskPrimers step instructs the tool to reverse complement the primer
sequences and check the tail of the read. The two primer regions have also been treated
differently. The V-segment primer has been masked (replaced by Ns) using the
--mode mask
argument
to preserve the V(D)J length, while the C-region primer has been removed
from the sequence using the --mode cut
argument.
During each iteration of the MaskPrimers tool an
annotation field was added with the name of the primer, with the name of the field
specified by the --pf
argument, such that after both
iterations each sequence contains an annotation of the form:
VPRIMER=V-segment primer|CPRIMER=C-region primer
To summarize these steps, the ParseLog tool is used to build tab-delimited files from the two MaskPrimers logs:
15 | ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR
|
Capturing the following annotations:
Field | Description |
---|---|
ID | Sequence name |
PRIMER | Primer name |
ERROR | Primer match error rate |
Note
This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read 1 is always the C-region end of the amplicon and read 2 is always the V-segment end. If your data is unstranded (50% of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers step to account for this by using the align subcommand instead:
MaskPrimers.py align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \
--maxlen 30 --mode mask --pf VPRIMER --log MP1.log
This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read for matches, and correcting the the output sequences to the forward orientation (V to J).
Deduplication and filtering¶
Removal of duplicate sequences¶
The last stage of the workflow involves two filtering steps to yield
the final repertoire. First, the set of unique sequences is identified using the
CollapseSeq tool, allowing for up to 20 interior N-valued positions
(-n 20
and --inner
),
and requiring that all reads considered duplicates share the same C-region primer
annotation (--uf CPRIMER
). Additionally, the
V-segment primer annotations of the set of duplicate reads are propagated into the
annotation of each retained unique sequence (--cf VPRIMER
and --act set
):
9 10 | CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
--cf VPRIMER --act set --outname M1
|
Filtering to repeated sequences¶
CollapseSeq stores the count of duplicate reads for each sequence in the
DUPCOUNT
annotation. Following duplicate removal, the data is subset
to only those unique sequence with at least two representative reads by
using the group subcommand of SplitSeq on the count field
(-f DUPCOUNT
) and specifying a numeric threshold
(--num 2
):
11 | SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
|
Creating an annotation table¶
Finally, the annotations, including duplicate read count (DUPCOUNT
), isotype
(CPRIMER
) and V-segment primer (VPRIMER
), of the final repertoire are then
extracted from the SplitSeq output into a tab-delimited file using the
table subcommand of ParseHeaders:
12 | ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER
|
Output files¶
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File | Description |
---|---|
M1_collapse-unique.fastq | Total unique sequences |
M1_atleast-2.fastq | Unique sequences represented by at least 2 reads |
M1_atleast-2_headers.tab | Annotation table of the atleast-2 file |
AP_table.tab | Table of the AssemblePairs log |
FS_table.tab | Table of the FilterSeq log |
MPV_table.tab | Table of the V-segment MaskPrimers log |
MPC_table.tab | Table of the C-region MaskPrimers log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.
UMI Barcoded Illumina MiSeq 2x250 BCR mRNA¶
Overview of Experimental Data¶
This example uses publicly available data from:
B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes.Stern JNH, Yaari G, and Vander Heiden JA, et al.Sci Transl Med. 2014. 6(248):248ra107. doi:10.1126/scitranslmed.3008879.
Which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample M12 (accession: SRR1383456), which may downloaded downloaded using fastq-dump from the SRA Toolkit:
fastq-dump --split-files -X 25000 SRR1383456
Primers sequences are available online at the supplemental website for the publication.
Read Configuration¶
Schematic of the Illumina MiSeq 2x250 paired-end reads with UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains a 15 nucleotide UMI barcode preceding the C-region primer sequence.
Example Data¶
We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow¶
In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart¶
Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into four primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensus sequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtain the high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.
Commands¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | #!/usr/bin/env bash
FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log
MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
--start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
--start 0 --mode mask --outname MS12_R2 --log MP2.log
PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
--1f BARCODE --coord sra
BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log
PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
--coord presto
AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
-2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
--1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log
ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min
CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
--cf CONSCOUNT --act sum --outname MS12
SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12
ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT
ParseLog.py -l FS1.log FS2.log -f ID QUALITY
ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \
PRFREQ ERROR
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE FIELDS1 FIELDS2
|
Quality control, UMI annotation and primer masking¶
Removal of low quality reads¶
Quality control begins with the identification and removal of
low-quality reads using the quality subcommand of the
FilterSeq tool. In this example, reads with mean Phred quality scores
less than 20 (-q 20
) are removed:
2 3 | FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log
|
The ParseLog tool is then used to extract results from the FilterSeq logs into tab-delimited files:
24 | ParseLog.py -l FS1.log FS2.log -f ID QUALITY
|
Extracting the following information from the log:
Field | Description |
---|---|
ID | Sequence name |
QUALITY | Quality score |
UMI annotation and masking of primer regions¶
Next, the score subcommand of MaskPrimers is
used to identify and remove the PCR primers for both reads. When
dealing with Ig sequences, it is important to cut or mask the primers,
as B cell receptors are subject to somatic hypermutation (the
accumulation of point mutations in the DNA) and degenerate primer
matches can look like mutations in downstream applications. The
MaskPrimers tool is also used to annotate each read 1 sequence
with the 15 nucleotide UMI that precedes the C-region primer
(MaskPrimers score --barcode
):
4 5 6 7 | MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
--start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
--start 0 --mode mask --outname MS12_R2 --log MP2.log
|
To summarize these steps, the ParseLog tool is used to build a tab-delimited file from the MaskPrimers log:
25 | ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR
|
Containing the following information:
Field | Description |
---|---|
ID | Sequence name |
PRIMER | Primer name |
BARCODE | UMI sequence |
ERROR | Primer match error rate |
Note
For this data set the UMI is immediately upstream of the C-region primer.
Another common approach for UMI barcoding involves placing the UMI
immediately upstream of a 5’RACE template switch site. Modifying the
workflow is simple for this case. You just need to replace the V-segment
primers with a fasta file containing the TS sequences and move the
--barcode
argument to the
appropriate read:
MaskPrimers.py score -s R1_quality-pass.fastq -p CPrimers.fasta \
--start 0 --mode cut --outname R1 --log MP1.log
MaskPrimers.py score -s R2_quality-pass.fastq -p TSSites.fasta \
--start 17 --barcode --mode cut --maxerror 0.5 \
--outname R2 --log MP2.log
In the above we have moved the UMI annotation to read 2, increased
the allowable error rate for matching the TS site
(--maxerror 0.5
),
cut the TS site (--mode cut
),
and increased the size of the UMI from 15 to 17 nucleotides
(--start 17
).
Generation of UMI consensus sequences¶
Copying the UMI annotation across paired-end files¶
In this task, a single consensus sequence is constructed for each set of
reads annotated with the same UMI barcode. As the UMI barcode is part of
read 1, the BARCODE
annotation identified by MaskPrimers must
first be copied to the read 2 mate-pair of each read 1
sequence. Propogation of annotations between mate pairs is performed
using PairSeq which also removes
unpaired reads and ensures that paired reads are sorted in the same
order across files:
8 9 | PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
--1f BARCODE --coord sra
|
Note
For both the PairSeq and AssemblePairs commands using the
correct --coord
argument is critical
for matching mate-pairs. If this was raw data from Illumina, rather than
data downloaded from SRA/ENA, then the appropriate argument would be
--coord illumina
.
Note
If you have followed the 5’RACE modification above, then you must also
modify the first PairSeq step to copy the UMI from read 2 to read 1,
instead of vice versa (--2f BARCODE
):
PairSeq.py -1 R1_primers-pass.fastq -2 R2_primers-pass.fastq \
--2f BARCODE --coord sra
Multiple alignment of UMI read groups¶
Before generating a consensus for a set of reads sharing a UMI barcode,
the sequences must be properly aligned. Sequences may not be aligned if
more than one PCR primer is identified in a UMI read group - leading to
variations in the the start positions of the reads. Ideally, each set of
reads originating from a single mRNA molecule should be amplified with
the same primer. However, different primers in the multiplex pool may be
incorporated into the same UMI read group during amplification if the
primers are sufficiently similar. This type of primer misalignment can
be corrected using the AlignSets tool. In the example data used here,
this step was not necessary due to the aligned primer design for the 45
V-segment primers, though this does require that the V-segment primers be
masked, rather than cut, during the MaskPrimers step
(--mode mask
).
See also
If your data requires alignment, then you can create multiple aligned UMI read groups as follows:
AlignSets.py muscle -s R1_primers-pass_pair-pass.fastq --bf BARCODE \
--exec ~/bin/muscle --outname R1 --log AS1.log
AlignSets.py muscle -s R2_primers-pass_pair-pass.fastq --bf BARCODE \
--exec ~/bin/muscle --outname R2 --log AS2.log
Where the --bf BARCODE
defines the field
containing the UMI and --exec ~/bin/muscle
is the location of the MUSCLE executable.
For additional details see the section on fixing UMI alignments.
Generating UMI consensus reads¶
After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus:
10 11 12 13 | BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log
|
To correct for UMI chemistry and sequencing errors, UMI read groups having
high error statistics (mismatch rate from consensus) are removed by
specifiying the --maxerror 0.1
threshold. As the accuracy of the primer assignment in read 1 is critical
for correct isotype identification, additional filtering of read 1 is carried out
during this step. Specifying the --prcons 0.6
threshold: (a) removes individual sequences that do not share a common primer annotation with
the majority of the set, (b) removes entire read groups which have
ambiguous primer assignments, and (c) constructs a consensus primer
assignment for each UMI.
Note
The --maxgap 0.5
argument tells
BuildConsensus to use a majority rule to delete any gap positions
which occur in more than 50% of the reads. The --maxgap
argument is not really necessary for this example data set as we did not perform
a multiple alignment of the UMI read groups. However, if you have performed an
alignment, then use of --maxgap
during consensus
generation is highly recommended.
The ParseLog tool is then used to build a tab-delimited file contain the consensus results:
26 | ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \
|
With the following annotations:
Field | Description |
---|---|
BARCODE | UMI sequence |
SEQCOUNT | Number of total reads in the UMI group |
CONSCOUNT | Number of reads used for the UMI consensus |
PRIMER | Set of primer names in the UMI group |
PRCONS | Consensus primer name |
PRCOUNT | Count of primers in the UMI group |
PRFREQ | Frequency of primers in the UMI group |
ERROR | Average mismatch rate from consensus |
Paired-end assembly of UMI consensus sequences¶
Syncronizing paired-end files¶
Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus. To synchronize the reads another instance of PairSeq must be run, but without any annotation manipulation:
14 15 | PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
--coord presto
|
Assembling UMI consensus mate-pairs¶
Once the files have been synchronized, each paired-end UMI consensus sequence is assembled into a full length Ig sequence using the align subcommand of AssemblePairs:
16 17 18 | AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
-2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
--1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log
|
During assembly, the consensus isotype annotation (PRCONS
) from read 1
and the number of reads used to define the consensus sequence (CONSCOUNT
)
for both reads are propagated into the annotations of the full length Ig sequence
(--1f CONSCOUNT --2f CONSCOUNT PRCONS
.
ParseLog is then uses to extract the results from the AssemblePairs log into a tab-delimited file:
27 | PRFREQ ERROR
|
Containing the following information:
Field | Description |
---|---|
ID | Sequence name (UMI) |
LENGTH | Length of the assembled sequence |
OVERLAP | Length of the overlap between mate-pairs |
ERROR | Mismatch rate of the overlapping region |
PVALUE | P-value for the assembly |
FIELDS1 | Annotations copied from read 2 into the assembled sequence |
FIELDS2 | Annotations copied from read 1 into the assembled sequence |
See also
Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.
Deduplication and filtering¶
Combining UMI read group size annotations¶
In the final stage of the workflow, the high-fidelity Ig repertoire is
obtained by a series of filtering steps. First, the annotation
specifying the number of raw reads used to build each sequence
(-f CONSCOUNT
) is updated to be the
minimum (--act min
) of the
forward and reverse reads using the
collapse subcommand of ParseHeaders:
19 | ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min
|
Removal of duplicate sequences¶
Second, duplicate nucleotide sequences are removed using the CollapseSeq
tool with the requirement that duplicate sequences share the same
isotype primer (--uf PRCONS
). The duplicate removal
step also removes sequences with a high number of interior N-valued nucleotides
(-n 20
and --inner
)
and combines the read counts for each UMI read group
(--cf CONSCOUNT
and --act sum
).
20 21 | CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
--cf CONSCOUNT --act sum --outname MS12
|
Filtering to sequences with at least two representative reads¶
Finally, unique sequences are filtered to those with at least 2
contributing sequences using the group subcommand of SplitSeq,
by splitting the file on the CONSCOUNT
annotation with a numeric threshold
(-f CONSCOUNT
and --num 2
):
22 | SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12
|
Creating an annotation table¶
For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders:
23 | ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT
|
Output files¶
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File | Description |
---|---|
M12_collapse-unique.fastq | Total unique sequences |
M12_atleast-2.fastq | Unique sequences represented by at least 2 reads |
M12_atleast-2_headers.tab | Annotation table of the atleast-2 file |
FS1_table.tab | Table of the read 1 FilterSeq log |
FS2_table.tab | Table of the read 2 FilterSeq log |
MP1_table.tab | Table of the C-region MaskPrimers log |
MP2_table.tab | Table of the V-segment MaskPrimers log |
BC1_table.tab | Table of the read 1 BuildConsensus log |
BC2_table.tab | Table of the read 2 BuildConsensus log |
AP_table.tab | Table of the AssemblePairs log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.
Performance¶
Example performance statistics for a comparable, but larger, MiSeq workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,723,558 x 2 raw reads, and required matching of 1 C-region primer, 45 V-segment primers, and averaged 24.3 reads per UMI.
Line | Tool | Reads | Cores | MB | Minutes |
---|---|---|---|---|---|
01 | R1: FilterSeq.py quality | 1,723,558 | 10 | 1,219 | 13.0 |
02 | R2: FilterSeq.py quality | 1,723,558 | 10 | 1,219 | 12.8 |
03 | R1: MaskPrimers.py score | 1,722,116 | 10 | 1,221 | 18.9 |
04 | R2: MaskPrimers.py score | 1,684,050 | 10 | 1,221 | 46.7 |
06 | PairSeq.py | 1,665,584 | 1 | 4,734 | 25.5 |
07 | R1: BuildConsensus.py | 1,565,017 | 10 | 1,228 | 50.1 |
08 | R2: BuildConsensus.py | 1,565,017 | 10 | 1,229 | 58.6 |
11 | AssemblePairs.py align | 66,285 | 10 | 358 | 3.0 |
13 | ParseHeaders.py collapse | 56,104 | 1 | 88 | 0.4 |
14 | CollapseSeq.py | 55,480 | 1 | 822 | 0.7 |
15 | SplitSeq.py group | 51,047 | 1 | 88 | 0.2 |
UMI Barcoded Illumina MiSeq 325+275 paired-end 5’RACE BCR mRNA¶
Overview of Experimental Data¶
This example uses publicly available data from:
Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing.Vander Heiden JA, et al.J Immunol. 2017 198(4):1460-1473. doi:10.4049/jimmunol.1601415.
Which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample HD09_N_AB8KB (accession: SRR4026043), which may downloaded downloaded using fastq-dump from the SRA Toolkit:
fastq-dump --split-files -X 25000 SRR4026043
Primers sequences are available online from the protocols/AbSeq directory in the Immcantation repository.
Read Configuration¶
Schematic of the Illumina MiSeq 325+275 paired-end reads with UMI barcodes. Each read was sequenced from one end of the target cDNA so that the two reads together cover the entire variable region of the Ig heavy chain. Sequencing was performed using an uneven number of cycles for read 1 and read 2 using a 2x300 kit. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), contains a partial C-region, and is 325 nucletoides in length. Read 2 contains the 5’RACE template switch site with a 17 nucleotide UMI barcode preceding it, and is 275 nucleotides in length.
Example Data¶
We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:
Overview of the Workflow¶
In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:
A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.
Flowchart¶
Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into four primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensus sequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtain the high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.
Commands¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | #!/usr/bin/env bash
FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log
MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
--start 0 --mode cut --outname HD09N-R1 --log MP1.log
MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
--start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log
PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
--2f BARCODE --coord sra
BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
--maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log
PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
--coord presto
AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
-2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
--coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
--aligner blastn --outname HD09N-C --log AP.log
MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
-p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
--mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log
ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min
CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
--uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C
SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
-f CONSCOUNT --num 2 --outname HD09N-C
ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT
ParseLog.py -l FS1.log FS2.log -f ID QUALITY
ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR
ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY
|
Quality control, UMI annotation and primer masking¶
Removal of low quality reads¶
Quality control begins with the identification and removal of
low-quality reads using the quality subcommand of the
FilterSeq tool. In this example, reads with mean Phred quality scores
less than 20 (-q 20
) are removed:
2 3 | FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log
|
The ParseLog tool is then used to extract results from the FilterSeq logs into tab-delimited files:
29 | ParseLog.py -l FS1.log FS2.log -f ID QUALITY
|
Extracting the following information from the log:
Field | Description |
---|---|
ID | Sequence name |
QUALITY | Quality score |
UMI annotation and removal of primer regions¶
Next, the score subcommand of MaskPrimers is
used to identify and remove the PCR primers for both reads. When
dealing with Ig sequences, it is important to cut or mask the primers,
as B cell receptors are subject to somatic hypermutation (the
accumulation of point mutations in the DNA) and degenerate primer
matches can look like mutations in downstream applications. The
MaskPrimers tool is also used to annotate each read 2 sequence
with the 17 nucleotide UMI that precedes the template switch site.
(MaskPrimers score --barcode
):
4 5 6 7 | MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
--start 0 --mode cut --outname HD09N-R1 --log MP1.log
MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
--start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log
|
To summarize these steps, the ParseLog tool is used to build a tab-delimited file from the MaskPrimers log:
30 | ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR
|
Containing the following information:
Field | Description |
---|---|
ID | Sequence name |
PRIMER | Primer name |
BARCODE | UMI sequence |
ERROR | Primer match error rate |
Note
For this data, we are using the 5’RACE template switch sequences
as proxy primers. We’ve set the match error rate extremely high
(--maxerror 0.5
)
because an accurate match isn’t important. Mostly, we are just concerned
with extracting the UMI barcode that precedes the template switch site.
Generation of UMI consensus sequences¶
Copying the UMI annotation across paired-end files¶
In this task, a single consensus sequence is constructed for each set of
reads annotated with the same UMI barcode. As the UMI barcode is part of
read 2, the BARCODE
annotation identified by MaskPrimers must
first be copied to the read 1 mate-pair of each read 2
sequence. Propogation of annotations between mate pairs is performed
using PairSeq which also removes
unpaired reads and ensures that paired reads are sorted in the same
order across files:
8 9 | PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
--2f BARCODE --coord sra
|
Note
For both the PairSeq and AssemblePairs commands using the
correct --coord
argument is critical
for matching mate-pairs. If this was raw data from Illumina, rather than
data downloaded from SRA, then the appropriate argument would be
--coord illumina
.
Multiple alignment of UMI read groups¶
Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar. This type of primer misalignment can be corrected using the AlignSets tool.
This step is not required for this data, but if necessary for your data see the section on fixing UMI problems.
Generating UMI consensus reads¶
After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus:
10 11 12 13 | BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
--prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
--maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log
|
To correct for UMI chemistry and sequencing errors, UMI read groups having
high error statistics (mismatch rate from consensus) are removed by
specifiying the --maxerror 0.1
threshold. Additional filtering of read 1 is carried out
during this step by specifying the --prcons 0.6
threshold which: (a) removes individual sequences that do not share a common primer annotation with
the majority of the set, (b) removes entire read groups which have
ambiguous primer assignments, and (c) constructs a consensus primer
assignment for each UMI.
Note
The --maxgap 0.5
argument tells
BuildConsensus to use a majority rule to delete any gap positions
which occur in more than 50% of the reads. The --maxgap
argument is not really necessary for this example data set as we did not perform
a multiple alignment of the UMI read groups. However, if you have performed an
alignment, then use of --maxgap
during consensus
generation is highly recommended.
The ParseLog tool is then used to build a tab-delimited file contain the consensus results:
31 | ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR
|
With the following annotations:
Field | Description |
---|---|
BARCODE | UMI sequence |
SEQCOUNT | Number of total reads in the UMI group |
CONSCOUNT | Number of reads used for the UMI consensus |
PRCONS | Consensus primer name |
PRFREQ | Frequency of primers in the UMI group |
ERROR | Average mismatch rate from consensus |
Paired-end assembly of UMI consensus sequences¶
Syncronizing paired-end files¶
Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus. To synchronize the reads another instance of PairSeq must be run, but without any annotation manipulation:
14 15 | PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
--coord presto
|
Assembling UMI consensus mate-pairs¶
Once the files have been synchronized, each paired-end UMI consensus
sequence is assembled into a full length Ig sequence using the
sequential subcommand of AssemblePairs.
5’RACE creates long amplicons which are not guaranteed to overlap when sequenced
using a 2x300 kit. The sequential subcommand will first attempt de
novo paired-end assembly. If that approach fails, it will attempt assembly guided
by the V segment reference sequences specified by the
-r
argument. Mate-pairs that fail to
overlap can thus be assembled, with the full length sequence containing an
appropriate number of intervening between the two reads.
16 17 18 19 | AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
-2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
--coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
--aligner blastn --outname HD09N-C --log AP.log
|
During assembly, the consensus primer annotation (PRCONS
) from read 1
and the number of reads used to define the consensus sequence (CONSCOUNT
)
for both reads are propagated into the annotations of the full length Ig sequence
(--1f CONSCOUNT --2f CONSCOUNT PRCONS
.
ParseLog is then uses to extract the results from the AssemblePairs log into a tab-delimited file:
32 | ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY
|
Containing the following information:
ID REFID LENGTH OVERLAP GAP ERROR IDENTITY
Field | Description |
---|---|
ID | Sequence name (UMI) |
REFID | The reference sequence name for reference guided assembly |
LENGTH | Length of the assembled sequence |
OVERLAP | Length of the overlap between mate-pairs |
GAP | Length of the gap between mate-pairs |
ERROR | Mismatch rate of the overlapping region from de novo assembly |
IDENTITY | Identity score for the aligned region in reference guided assembly |
Deduplication and filtering¶
Annotating sequences with the constant region¶
In the final stage of the workflow, the high-fidelity Ig repertoire is obtained by a series of filtering steps. First, an additional annotation defining the constant region is added to each sequence by aligning against the constant region sequence internal to the primer. This step is not strictly necessary, as the constant region primer is usually sufficient for isotype calls. However, primer calls may be inaccurate with some primer sets, so this additional confirmatory step is beneficial.
Internal constant regions are assigned using the align subcommand
of MaskPrimers with a set of manually constructed sequences as the
“primers” (-p AbSeq_Human_IG_InternalCRegion.fasta
):
20 21 22 | MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
-p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
--mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log
|
Here we use a high error rate for the match
(--maxerror 0.3
), restrict the match
to 100 nucleotides at the end of the sequence
(--maxlen 100 --revpr --skiprc
), and
use mode the argument --mode tag
to specify
that the input sequence should not be modified. The aligned C-region name
will be added to each sequence headers in the CREGION
field
(--pf CREGION
):
Combining UMI read group size annotations¶
The annotation specifying the number of raw reads used to build each sequence
(-f CONSCOUNT
) is updated to be the
minimum (--act min
) of the
forward and reverse reads using the
collapse subcommand of ParseHeaders:
23 | ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min
|
Removal of duplicate sequences¶
Second, duplicate nucleotide sequences are removed using the CollapseSeq
tool with the requirement that duplicate sequences share the same
constant region annotation (--uf CREGION
). The duplicate removal
step also removes sequences with a high number of interior N-valued nucleotides
(-n 20
and --inner
)
and combines the read counts for each UMI read group
(--cf CONSCOUNT
and --act sum
).
24 25 | CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
--uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C
|
Filtering to sequences with at least two representative reads¶
Finally, unique sequences are filtered to those with at least 2
contributing sequences using the group subcommand of SplitSeq,
by splitting the file on the CONSCOUNT
annotation with a numeric threshold
(-f CONSCOUNT
and --num 2
):
26 27 | SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
-f CONSCOUNT --num 2 --outname HD09N-C
|
Creating an annotation table¶
For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders:
28 | ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT
|
Output files¶
The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:
File | Description |
---|---|
HD09N-C_collapse-unique.fastq | Total unique sequences |
HD09N-C_atleast-2.fastq | Unique sequences represented by at least 2 reads |
HD09N-C_atleast-2_headers.tab | Annotation table of the atleast-2 file |
FS1_table.tab | Table of the read 1 FilterSeq log |
FS2_table.tab | Table of the read 2 FilterSeq log |
MP1_table.tab | Table of the C-region primer MaskPrimers log |
MP2_table.tab | Table of the TS site MaskPrimers log |
MP3_table.tab | Table of the internal C-region MaskPrimers log |
BC1_table.tab | Table of the read 1 BuildConsensus log |
BC2_table.tab | Table of the read 2 BuildConsensus log |
AP_table.tab | Table of the AssemblePairs log |
A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.
Fixing UMI Problems¶
Correcting misaligned V-segment primers and indels in UMI groups¶
Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar.
Correction of misaligned sequences. (A) Discrepancies in the location of primer binding (colored bases, with primer name indicated to the left) may cause misalignment of sequences sharing a UMI. (B) Following multiple alignment of the reads the non-primer regions are correctly aligned and suitable for UMI consensus generation.
This type of primer misalignment can be corrected using one of two approaches using the AlignSets tool.
Correcting via multiple alignment¶
The first approach, which is conceptually simpler but computationally more expensive,
is to perform a full multiple alignment of reach UMI read group using the
muscle subcommand of AlignSets. The
--bf BARCODE
argument tells AlignSets to
multiple align reads sharing the same BARCODE
annotation.
The --exec ~/bin/muscle
is a pointer to
where the MUSCLE executable is located:
AlignSets.py muscle -s reads.fastq --bf BARCODE --exec ~/bin/muscle
The above approach will also insert gaps into the sequences where an
insertion/deletion has occured in the reads. As such, you will need to provide
as reasonable gap character threshold to BuildConsensus, such as
--maxgap 0.5
, defining how you want to handle
positions with gap characters when generating a UMI consensus sequence.
Note
Using the muscle subcommand, along with the
--maxgap
argument to BuildConsensus
will also address issue with insertions/deletions in UMI read groups.
Though, in UMI read groups with a sufficient number of reads consensus generation
will resolve insertions/deletions without the need for multiple alignment,
as any misaligned reads will simply be washed out by the majority.
Whether to perform a multiple alignment prior to consensus generation is a
matter of taste. A multiple alignment may improve consensus quality in
small UMI read groups (eg, less than 4 sequences), but the extent to which
small UMI read groups should be trusted is debatable.
Correcting via an offset table¶
The second approach will correct only the primer regions and will not address insertions/deletions within the sequence, but is much quicker to perform. The first step involves creation of a primer offset table using the table subcommand of AlignSets:
AlignSets.py table -p primers.fasta --exec ~/bin/muscle
Which performs a multiple alignment on sequences in primers.fasta
(sequences shown in the primer alignment figure above)
to generate a file containing a primer offset table:
primers_offsets-forward.tab
¶VP1 2
VP2 0
VP3 1
Then the offset table can be input into the offset subcommand of AlignSets to align the reads:
AlignSets.py offset -s reads.fastq -d primers_offsets-forward.tab \
--bf BARCODE --pr VPRIMER --mode pad
In the above command we have specified the field containing the primer annotation
using --pf VPRIMER
and set the behavior
of the tool to add gap characters to align the reads with the
--mode pad
argument.
These options will generate the correction shown in (B) of the
primer alignment figure above. Alternatively,
we could have deleted unalign positions using the argument
--mode cut
.
Note
You may need to alter how the offset table is generated if you have used the
--mode cut
argument to MaskPrimers
rather than --mode mask
, as this will
cause the ends of the primer regions, rather than the front, to be the
cause of the ragged edges within the UMI read groups. For primers that
have been cut you would add the --reverse
argument to the table operation of AlignSets, which will
create an offset table that is based on the tail end of the primers.
Dealing with insufficient UMI diversity¶
Due to errors in the UMI region and/or insufficient UMI length, UMI read groups
are not always homogeneous with respect to the mRNA of origin. This can cause
difficulties in generating a valid UMI consensus sequence. In most cases,
the --prcons
and
--maxerror
(or --maxdiv
) arguments to BuildConsensus are
sufficient to filter out invalid reads and/or entire invalid UMI groups. However, if
there is significant nucleotide diversity within UMI groups due to insufficient
UMI length or low UMI diversity, the set command of the ClusterSets
tool can help correct for this. ClusterSets will cluster sequence by
similarity and add an additional annotation dividing sequences within a UMI read group
into sub-clusters:
ClusterSets.py set -s reads.fastq -f BARCODE -k CLUSTER --exec ~/bin/usearch
The above command will add an annotation to each sequence named CLUSTER
(-k CLUSTER
) containing a cluster identifier
for each sequence within the UMI barcode group.
The -f BARCODE
argument specifies the UMI annotation and
--exec ~/bin/usearch
is a pointer to
where the USEARCH executable is located. After
assigning cluster annotations via ClusterSets, the BARCODE
and CLUSTER
fields can be merged using the copy operation of ParseHeaders:
ParseHeaders.py copy -s reads_cluster-pass.fastq -f BARCODE -k CLUSTER --act cat
Which will copy the UMI annotation (-f BARCODE
) into
the cluster annotation (-k CLUSTER
) and concatenate
them together (--act cat
). Thus converting the
annotations from:
>SEQ1|BARCODE=ATGTCG|CLUSTER=1
>SEQ2|BARCODE=ATGTCG|CLUSTER=2
To:
>SEQ1|BARCODE=ATGTCG|CLUSTER=1ATGTCG
>SEQ2|BARCODE=ATGTCG|CLUSTER=2ATGTCG
You may then specify --bf CLUSTER
to
BuildConsensus to tell it to generate UMI consensus sequences by
UMI sub-cluster, rather than by UMI barcode annotation.
Combining split UMIs¶
Typically, a UMI barcode is attached to only one end of a paired-end mate-pair and can be copied to other read by a simple invocation of PairSeq. But in some cases, the UMI may be split such that there are two UMIs, each located on a different mate-pair. To deal with these sorts of UMIs, you would first employ PairSeq similarly to how you would in the single UMI case:
PairSeq.py -1 reads-1.fastq -2 reads-2.fastq --1f BARCODE --2f BARCODE \
--coord illumina
The main difference from the single UMI case is that the BARCODE
annotation is
being simultaneously copied from read 1 to read 2 (--1f BARCODE
)
andf rom read 2 to read 1 (--2f BARCODE
). This creates
a set of annotations that look like:
>READ1|BARCODE=ATGTCGTT,GGCTAGTC
>READ2|BARCODE=ATGTCGTT,GGCTAGTC
Alternatively, these annotations can be combined upon copy using the
--act cat
argument:
PairSeq.py -1 reads-1.fastq -2 reads-2.fastq --1f BARCODE --2f BARCODE \
--coord illumina --act cat
Which concatenates the two values in the BARCODE
field,
yielding UMI annotations suitable for input to BuildConsensus:
>READ1|BARCODE=ATGTCGTTGGCTAGTC
>READ2|BARCODE=ATGTCGTTGGCTAGTC
Compensating for errors in the UMI region¶
Depending on the protocol used for library preparation, PCR error and sequencing error can significantly affect UMI and sequence assignments. To account for this error, the following approach can be used.
Clustering UMI sequences¶
First, errors in the UMI region can be accounted for by reassigning UMI
groups to new clusters of UMI sequence that may differ by one or more nucleotides.
To identify the ideal threshold at which to cluster similar UMI sequences,
EstimateError can be run on the UMI field (BARCODE
):
EstimateError.py barcode -s reads.fastq -f BARCODE
The -f BARCODE
defines the header annotation
containing the UMI sequence. This outputs the following tables:
File | Error profile |
---|---|
reads_distance-barcode.tab | Distribution of pairwise hamming distances |
reads_threshold-barcode.tab | Recommended threshold |
The value in the THRESHOLD
column associated with the ALL
row in
reads_threshold-barcode.tab
specifies a recommended threshold for clustering
the UMI sequences.
Note
Subsampling at a depth to approximately 5,000 sequences is recommended to expedite this calculation. See the random task for an example of how to use SplitSeq to subsample sequence files.
The table specifies a threshold of 0.9
which will be used to cluster
the UMI sequences via ClusterSets. The identity threshold is set
via the argument --ident 0.9
.
Clustering will be performed on the sequences in the UMI annotation field
(-f BARCODE
) and UMI clusters will
assigned to the annotation field INDEX_UMI
via the argument
-k INDEX_UMI
:
ClusterSets.py barcode -s reads.fastq -f BARCODE -k INDEX_UMI --ident 0.9
Clustering V(D)J sequences¶
Next, sequences within these larger UMI clusters are clustered to avoid
sequence collisions. Again, EstimateError is used to infer a clustering
threshold, but instead of clustering UMI sequences the set subcommand
is used to cluster the reads (V(D)J sequences) within the newly assigned UMI
clusters (-f INDEX_UMI
):
EstimateError.py set -s reads_cluster-pass.fastq -f INDEX_UMI
This outputs the following tables:
File | Error profile |
---|---|
reads_cluster-pass_distance-set.tab | Distribution of pairwise hamming distances |
reads_cluster-pass_threshold-set.tab | Recommended threshold |
The value in the THRESHOLD
column associated with the ALL
row in
reads_cluster-pass_threshold-set.tab
specifies a recommended threshold for
resolving collisions.
Note
Subsampling at a depth to approximately 5,000 sequences is recommended to expedite this calculation. See the random task for an example of how to use SplitSeq to subsample sequence files.
Using a recommended threshold of 0.8
, V(D)J sequences are clustering in a
similar way using the set subcommand of ClusterSets:
ClusterSets.py set -s reads_cluster-pass.fastq -f INDEX_UMI -k INDEX_SEQ --ident 0.8
Where the argument --ident 0.8
specifies the clustering
threshold, -f INDEX_UMI
defines the UMI cluster group to
cluster within, and -k INDEX_SEQ
defines the V(D)J sequence
cluster annotation to add to the output headers.
Combining the UMI and V(D)J cluster annotations¶
Finally, new UMI groups can be generated by combining the two annotation fields
generated during the clustering steps with the merge subcommand of
ParseHeaders. The -f INDEX_UMI INDEX_SEQ
argument defines the fields to combine and the
-k INDEX_MERGE
argument defines the new field
that will contain the corrected UMI clusters used for consensus generation:
ParseHeaders.py merge -s reads_cluster-pass_cluster-pass.fastq -f INDEX_UMI INDEX_SEQ \
-k INDEX_MERGE
This combined UMI-V(D)J sequence cluster annotation can then be specified as the
barcode field to BuildConsensus using the
--bf INDEX_MERGE
argument.
Manipulating Annotations¶
The ParseHeaders tool provides a collection of methods for performing simple manipulations of sequence headers that are format in the pRESTO annotation scheme.
For converting sequence headers into the pRESTO format see the Import Task documentation.
Adding a sample annotation¶
Addition of annotation values is accomplished using the add subcommand of ParseHeaders:
ParseHeaders.py add -s reads.fastq -f SAMPLE -u A1
Which will add the annotation SAMPLE=A1
to each sequence of the input file.
Expanding and renaming annotations¶
By default, pRESTO will not delete annotations. If a sequence header already
contains an annotation that a tool is trying to add it will not overwrite that
annotation. Instead, it will append the annotation value to the values already
present in a comma delimited form. For example, after two interations of
MaskPrimers with the default primer field name (PRIMER
), you will have
an annotation in the following form, reflecting a match against primer VH3
in
the first iteration and primer IGHM
in the second:
PRIMER=VH3,IGHM
Separating these annotations into two annotations is accomplished via the expand subcommand of ParseHeaders:
ParseHeaders.py expand -s reads.fastq -f PRIMER
Resulting in the annotations:
PRIMER1=VH3|PRIMER2=IGHM
Which may then be renamed via the rename subcommand: expand subcommand of ParseHeaders:
ParseHeaders.py rename -s reads_reheader.fastq -f PRIMER1 PRIMER2 \
-k VPRIMER CPRIMER
Copying, merging and collapsing annotations¶
Nested annotation can be generated using the copy or merge subcommands of ParseHeaders. The examples that follow will use the starting annotation:
UMI=ATGC|CELL=GGCC|COUNT=10,2
The UMI
and CELL
annotations can be combined into a single INDEX
annotation using the following command:
ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete
# result> COUNT=10,2|INDEX=ATGC,GGCC
Without the --delete
argument the
original UMI
and CELL
annotations would be kept in the header.
The nested annotation values can then be combined using the collapse subcommand to create various effects:
ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act cat
# result> INDEX=ATGCGGCC
ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act first
# result> INDEX=ATGC
ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act sum
# result> COUNT=12
ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act min
# result> COUNT=2
Where the --act
argument specifies
the type of collapse action to perform.
The copy subcommand is normally used to create duplicate annotations with different names, but will have a similar effect to the merge subcommand when the target is an existing field:
ParseHeaders.py copy -s reads.fasta -f UMI -k CELL
# result> UMI=ATGC|CELL=GGCC,ATGC|COUNT=10,2
Both the copy and merge subcommands have an
--act
argument which allows
you to perform an action from the collapse subcommand
in the same step as the copy or merge:
ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete --act cat
# result> COUNT=10,2|INDEX=ATGCGGCC
ParseHeaders.py copy -s reads.fasta -f UMI -k CELL --act cat
# result> UMI=ATGC|CELL=GGCCATGC|COUNT=10,2
Deleting annotations¶
Unwanted annotations can be deleted using the delete subcommand of ParseHeaders:
ParseHeaders.py delete -s reads.fastq -f PRIMER
Which will remove the PRIMER
field from each sequence header.
Miscellaneous Tasks¶
Importing data from SRA, ENA or GenBank into pRESTO¶
If you have download a data set from GenBank, SRA or ENA the format of the sequences headers are different from the raw Roche 454 and Illumina header format. As such, they may or may not be compatible with pRESTO, depending on how the headers have been modified by the sequence archive. The ConvertHeaders allow you to change incompatible header formats into the pRESTO format. For example, to convert from SRA or ENA headers the sra subcommand would be used:
ConvertHeaders.py sra -s reads.fastq
ConvertHeaders provides the following conversion subcommands:
Subcommand | Formats Converted |
---|---|
generic | Headers with an unknown annotation system |
454 | Roche 454 |
genbank | NCBI GenBank and RefSeq |
illumina | Illumina HiSeq or MiSeq |
imgt | IMGT/GENE-DB |
sra | NCBI SRA or EMBL-EBI ENA |
Reducing file size for submission to IMGT/HighV-QUEST¶
IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of SplitSeq to divide your files into small pieces.
SplitSeq.py count -s reads.fastq -n 500000 --fasta
The -n 500000
argument sets the maximum number of
sequences in each file and the --fasta
tells the tool to output a FASTA, rather than FASTQ, formatted file.
Note
You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using the CollapseSeq tool.
Subsetting sequence files by annotation¶
The group subcommand of SplitSeq allows you to split one file
into multiple files based on the values in a sequence annotation. For example,
splitting one file with multiple SAMPLE
annotations into separate files
(one for each sample) would be accomplished by:
SplitSeq.py group -s reads.fastq -f SAMPLE
Which will create a set of files labelled SAMPLE-M1
and SAMPLE-M2
, if samples are
named M1
and M2
.
If you wanted to split based on a numeric value, rather than a set of categorical values,
then you would add the --num
argument. SplitSeq
would then create two files: one containing sequences with values less than the threshold
specified by the --num
argument and one file containing
sequences with values greater than or equal to the threshold:
SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2
Which will create two files with the labels atleast-2
and under-2
.
Random sampling from sequence files¶
The sample subcommand of SplitSeq may be used to generate a
random sample from a sequence file or set of pair-end files. The example below
will select a random sample of 1,000 sequences (-n 1000
)
which all contain the annotation SAMPLE=M1
(-f SAMPLE
and -u M1
):
SplitSeq.py sample -s reads.fastq -f SAMPLE -u M1 -n 1000
Performing an analogous sampling of Illumina paired-end reads would be accomplished using the samplepair subcommand:
SplitSeq.py samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina
Cleaning or removing poor quality sequences¶
Data sets can be cleaned using one or more invocations of FilterSeq, which provides multiple sequence quality control operations. Four subcommands remove sequences from the data that fail to meet some threshold: including length, (length), number of N or gap characters (missing), homopolymeric tract length (repeats), or mean Phred quality score (quality). Two subcommands modify sequences without removing them: trimqual truncates the sequences when the mean Phred quality scores decays under a threshold, and maskqual replaces positions with low Phred quality scores with N characters.
FilterSeq provides the following quality control subcommands:
Subcommand | Operation |
---|---|
length | Removes short sequences |
missing | Removes sequences with too many Ns or gaps |
repeats | Removes sequences with long homopolymeric tracts |
quality | Removes sequences with low mean quality scores |
trimqual | Truncates sequences where quality scores decay |
maskqual | Masks low quality positions |
Assembling paired-end reads that do not overlap¶
The typical way to assemble paired-end reads is via de novo assembly using the align subcommand of AssemblePairs. However, some sequences with long CDR3 regions may fail to assemble due to insufficient, or completely absent, overlap between the mate-pairs. The reference or sequential subcommands can be used to assemble mate-pairs that do not overlap using the ungapped V-segment references sequences as a guide.
To handle such sequence in two separate steps, a normal align command
would be performed first. The --failed
argument is added so that the reads failing de novo alignment are output to
separate files:
AssemblePairs.py align -1 reads-1.fastq -2 reads-2.fastq --rc tail \
--coord illumina --failed -outname align
Then, the files labeled assemble-fail
, along with the ungapped V-segment
reference sequences (-r vref.fasta
),
would be input into the reference subcommand of AssemblePairs:
AssemblePairs.py reference -1 align-1_assemble-fail.fastq -2 align-2_assemble-fail.fastq \
--rc tail -r vref.fasta --coord illumina --outname ref
This will result in two separate assemble-pass
files - one from each step. You may
process them separately or concatenate them together into a single file:
cat align_assemble-pass.fastq ref_assemble-pass.fastq > merged_assemble-pass.fastq
However, if you intend to processes them together, you may simplify this by perform both steps using the sequential subcommand, which will attempt de novo assembly followed by reference guided assembly if de novo assembly fails:
AssemblePairs.py sequential -1 reads-1.fastq -2 reads-2.fastq --rc tail \
--coord illumina -r vref.fasta
Note
The sequences output by the reference or sequential subcommands may contain an appropriate length spacer of Ns between any mate-pairs that do not overlap. The :option:–fill <AssemblePairs reference –fill>`` argument may be specified to force AssemblePairs to insert the germline sequence into the missing positions, but this should be used with caution as the inserted sequence may not be biologically correct.
Assigning isotype annotations from the constant region sequence¶
MaskPrimers is usually used to remove primer regions and annotate sequences with primer identifiers. However, it can be used for any other case where you need to align a set of short sequences against the reads. One example alternate use is where you either do not know the C-region primer sequences or do not trust the primer region to provide an accurate isotype assignment.
If you build a FASTA file containing the reverse-complement of short sequences from the front of CH-1, then you can annotate the reads with these sequence in the same way you would C-region specific primers:
MaskPrimers.py align -s reads.fastq -p IGHC.fasta --maxlen 100 --maxerror 0.3 \
--mode cut --revpr --pf C_CALL
Where --revpr
tells MaskPrimers to
reverse-complement the “primer” sequences and look for them at the end of the reads,
--maxlen 100
restricts the search to the last
100 bp, --maxerror 0.3
allows for up to
30% mismatches, and -p IGHC.fasta
specifies the file
containing the CH-1 sequences. The name of the C-region will be added to the sequence
headers as the C_CALL
annotation, where the field name is specified by the
--pf
argument. An example CH-1 sequence file would look like:
>IGHD
CTGATATGATGGGGAACACATCCGGAGCCTTGGTGGGTGC
>IGHM
AGGAGACGAGGGGGAAAAGGGTTGGGGCGGATGCACTCCC
>IGHG
AGGGYGCCAGGGGGAAGACSGATGGGCCCTTGGTGGAAGC
>IGHA
MGAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGGGGATGC
>IGHE
AGCGGGTCAAGGGGAAGACGGATGGGCTCTGTGTGGAGGC
See also
Constant region reference sequences may be downloaded from IMGT and the sequence headers can be reformated using the imgt subcommand of ConvertHeaders. Note, you may need to clean-up the reference sequences a bit before running ConvertHeaders if you receive an error about duplicate sequence names (eg, remove duplicate allele with different artificial splicing). To cut and reverse-complement the constant region sequences use something like seqmagick.
Estimating sequencing and PCR error rates with UMI data¶
The EstimateError tool provides methods for estimating the combined
PCR and sequencing error rates from large UMI read groups. The assumptions being,
that consensus sequences generated from sufficiently large UMI read groups should
be accurate representations of the true sequences, and that the rate of mismatches
from consensus should therefore be an accurate estimate of the error rate in
the data. However, this is not guaranteed to be true, hence this approach can only
be considered an estimate of a data set’s error profile. The following command
generates an error profile from UMI read groups with 50 or more sequences
(-n 50
), using a majority rule consensus sequence
(--mode freq
), and excluding UMI read groups
with high nucleotide diversity (--maxdiv 0.1
):
EstimateError.py -s reads.fastq -n 50 --mode freq --maxdiv 0.1
This generates the following tab-delimited files containing error rates broken down by various criteria:
File | Error profile |
---|---|
reads_error-position.tab | Error rates by read position |
reads_error-quality.tab | Error rates by quality score |
reads_error-nucleotide.tab | Error rates by nucleotide identity |
reads_error-set.tab | Error rates by UMI read group size |
Commandline Usage¶
AlignSets¶
Multiple aligns input sequences by group
usage: AlignSets [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- align-pass
- multiple aligned reads.
- align-fail
- raw reads failing multiple alignment.
- offsets-forward
- 5’ offset table for input into offset subcommand.
- offsets-reverse
- 3’ offset table for input into offset subcommand.
- output annotation fields:
- None
AlignSets muscle¶
Align sequence sets using muscle.
usage: AlignSets muscle [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--bf BARCODE_FIELD] [--div]
[--exec ALIGNER_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--bf
<barcode_field>
¶ The annotation field containing barcode labels for sequence grouping.
-
--div
¶
Specify to calculate nucleotide diversity of each set (average pairwise error rate).
-
--exec
<aligner_exec>
¶ The name or location of the muscle executable.
AlignSets offset¶
Align sequence sets using predefined 5’ offset.
usage: AlignSets offset [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-d OFFSET_TABLE] [--bf BARCODE_FIELD]
[--pf PRIMER_FIELD] [--mode {pad,cut}] [--div]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-d
<offset_table>
¶ The tab delimited file of offset tags and values.
-
--bf
<barcode_field>
¶ The annotation field containing barcode labels for sequence grouping.
-
--pf
<primer_field>
¶ The primer field to use for offset assignment.
-
--mode
{pad,cut}
¶ Specifies whether or align sequence by padding with gaps or by cutting the 5’ sequence to a common start position.
-
--div
¶
Specify to calculate nucleotide diversity of each set (average pairwise error rate).
AlignSets table¶
Create a 5’ offset table by primer multiple alignment.
usage: AlignSets table [--version] [-h] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed]
[--delim DELIMITER DELIMITER DELIMITER] -p PRIMER_FILE
[-o OUT_FILE] [--reverse] [--exec ALIGNER_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-p
<primer_file>
¶ A FASTA file containing primer sequences.
-
-o
<out_file>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--reverse
¶
If specified create a 3’ offset table instead
-
--exec
<aligner_exec>
¶ The name or location of the muscle executable.
AssemblePairs¶
Assembles paired-end reads into a single sequence
usage: AssemblePairs [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- assemble-pass
- successfully assembled reads.
- assemble-fail
- raw reads failing paired-end assembly.
- output annotation fields:
- <user defined>
- annotation fields specified by the –1f or –2f arguments.
AssemblePairs align¶
Assemble pairs by aligning ends.
usage: AssemblePairs align [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...]
-2 SEQ_FILES_2 [SEQ_FILES_2 ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC]
[--coord {illumina,solexa,sra,454,presto}]
[--rc {tail,head,both,none}]
[--1f HEAD_FIELDS [HEAD_FIELDS ...]]
[--2f TAIL_FIELDS [TAIL_FIELDS ...]]
[--alpha ALPHA] [--maxerror MAX_ERROR]
[--minlen MIN_LEN] [--maxlen MAX_LEN] [--scanrev]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-1
<seq_files_1>
¶ An ordered list of FASTA/FASTQ files containing head/primary sequences.
-
-2
<seq_files_2>
¶ An ordered list of FASTA/FASTQ files containing tail/secondary sequences.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--coord
{illumina,solexa,sra,454,presto}
¶ The format of the sequence identifier which defines shared coordinate information across paired ends.
-
--rc
{tail,head,both,none}
¶ Specify which read to reverse complement before stitching.
-
--1f
<head_fields>
¶ Specify annotation fields to copy from head records into assembled record.
-
--2f
<tail_fields>
¶ Specify annotation fields to copy from tail records into assembled record.
-
--alpha
<alpha>
¶ Significance threshold for de novo paired-end assembly.
-
--maxerror
<max_error>
¶ Maximum allowable error rate for de novo assembly.
-
--minlen
<min_len>
¶ Minimum sequence length to scan for overlap in de novo assembly.
-
--maxlen
<max_len>
¶ Maximum sequence length to scan for overlap in de novo assembly.
-
--scanrev
¶
If specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhang the end of the tail sequence.
AssemblePairs join¶
Assemble pairs by concatenating ends.
usage: AssemblePairs join [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
SEQ_FILES_2 [SEQ_FILES_2 ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC]
[--coord {illumina,solexa,sra,454,presto}]
[--rc {tail,head,both,none}]
[--1f HEAD_FIELDS [HEAD_FIELDS ...]]
[--2f TAIL_FIELDS [TAIL_FIELDS ...]] [--gap GAP]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-1
<seq_files_1>
¶ An ordered list of FASTA/FASTQ files containing head/primary sequences.
-
-2
<seq_files_2>
¶ An ordered list of FASTA/FASTQ files containing tail/secondary sequences.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--coord
{illumina,solexa,sra,454,presto}
¶ The format of the sequence identifier which defines shared coordinate information across paired ends.
-
--rc
{tail,head,both,none}
¶ Specify which read to reverse complement before stitching.
-
--1f
<head_fields>
¶ Specify annotation fields to copy from head records into assembled record.
-
--2f
<tail_fields>
¶ Specify annotation fields to copy from tail records into assembled record.
-
--gap
<gap>
¶ Number of N characters to place between ends.
AssemblePairs reference¶
Assemble pairs by aligning reads against a reference database.
usage: AssemblePairs reference [--version] [-h] -1 SEQ_FILES_1
[SEQ_FILES_1 ...] -2 SEQ_FILES_2
[SEQ_FILES_2 ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC]
[--coord {illumina,solexa,sra,454,presto}]
[--rc {tail,head,both,none}]
[--1f HEAD_FIELDS [HEAD_FIELDS ...]]
[--2f TAIL_FIELDS [TAIL_FIELDS ...]] -r
REF_FILE [--minident MIN_IDENT]
[--evalue EVALUE] [--maxhits MAX_HITS] [--fill]
[--aligner {blastn,usearch}]
[--exec ALIGNER_EXEC] [--dbexec DB_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-1
<seq_files_1>
¶ An ordered list of FASTA/FASTQ files containing head/primary sequences.
-
-2
<seq_files_2>
¶ An ordered list of FASTA/FASTQ files containing tail/secondary sequences.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--coord
{illumina,solexa,sra,454,presto}
¶ The format of the sequence identifier which defines shared coordinate information across paired ends.
-
--rc
{tail,head,both,none}
¶ Specify which read to reverse complement before stitching.
-
--1f
<head_fields>
¶ Specify annotation fields to copy from head records into assembled record.
-
--2f
<tail_fields>
¶ Specify annotation fields to copy from tail records into assembled record.
-
-r
<ref_file>
¶ A FASTA file containing the reference sequence database.
-
--minident
<min_ident>
¶ Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and 1).
-
--evalue
<evalue>
¶ Minimum E-value for reference alignment for both the head and tail sequence.
-
--maxhits
<max_hits>
¶ Maximum number of hits from the reference alignment to check for matching head and tail sequence assignments.
-
--fill
¶
Specify to change the behavior of inserted characters when the head and tail sequences do not overlap during reference guided assembly. If specified, this will result in inserted of the V region reference sequence instead of a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences by using this option.
-
--aligner
{blastn,usearch}
¶ The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).
-
--exec
<aligner_exec>
¶ The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the –aligner argument.
-
--dbexec
<db_exec>
¶ The name or location of the executable file that builds the reference database. This defaults to makeblastdb when blastn is specified to the –aligner argument, and usearch when usearch is specified.
AssemblePairs sequential¶
Assemble pairs by first attempting de novo assembly, then reference guided assembly.
usage: AssemblePairs sequential [--version] [-h] -1 SEQ_FILES_1
[SEQ_FILES_1 ...] -2 SEQ_FILES_2
[SEQ_FILES_2 ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC]
[--coord {illumina,solexa,sra,454,presto}]
[--rc {tail,head,both,none}]
[--1f HEAD_FIELDS [HEAD_FIELDS ...]]
[--2f TAIL_FIELDS [TAIL_FIELDS ...]]
[--alpha ALPHA] [--maxerror MAX_ERROR]
[--minlen MIN_LEN] [--maxlen MAX_LEN]
[--scanrev] -r REF_FILE [--minident MIN_IDENT]
[--evalue EVALUE] [--maxhits MAX_HITS]
[--fill] [--aligner {blastn,usearch}]
[--exec ALIGNER_EXEC] [--dbexec DB_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-1
<seq_files_1>
¶ An ordered list of FASTA/FASTQ files containing head/primary sequences.
-
-2
<seq_files_2>
¶ An ordered list of FASTA/FASTQ files containing tail/secondary sequences.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--coord
{illumina,solexa,sra,454,presto}
¶ The format of the sequence identifier which defines shared coordinate information across paired ends.
-
--rc
{tail,head,both,none}
¶ Specify which read to reverse complement before stitching.
-
--1f
<head_fields>
¶ Specify annotation fields to copy from head records into assembled record.
-
--2f
<tail_fields>
¶ Specify annotation fields to copy from tail records into assembled record.
-
--alpha
<alpha>
¶ Significance threshold for de novo paired-end assembly.
-
--maxerror
<max_error>
¶ Maximum allowable error rate for de novo assembly.
-
--minlen
<min_len>
¶ Minimum sequence length to scan for overlap in de novo assembly.
-
--maxlen
<max_len>
¶ Maximum sequence length to scan for overlap in de novo assembly.
-
--scanrev
¶
If specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhang the end of the tail sequence.
-
-r
<ref_file>
¶ A FASTA file containing the reference sequence database.
-
--minident
<min_ident>
¶ Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and 1).
-
--evalue
<evalue>
¶ Minimum E-value for reference alignment for both the head and tail sequence.
-
--maxhits
<max_hits>
¶ Maximum number of hits from the reference alignment to check for matching head and tail sequence assignments.
-
--fill
¶
Specify to change the behavior of inserted characters when the head and tail sequences do not overlap during reference guided assembly. If specified, this will result in inserted of the V region reference sequence instead of a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences by using this option.
-
--aligner
{blastn,usearch}
¶ The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).
-
--exec
<aligner_exec>
¶ The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the –aligner argument.
-
--dbexec
<db_exec>
¶ The name or location of the executable file that builds the reference database. This defaults to makeblastdb when blastn is specified to the –aligner argument, and usearch when usearch is specified.
BuildConsensus¶
Builds a consensus sequence for each set of input sequences
usage: BuildConsensus [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-n MIN_COUNT] [--bf BARCODE_FIELD]
[-q MIN_QUAL] [--freq MIN_FREQ] [--maxgap MAX_GAP]
[--pf PRIMER_FIELD] [--prcons PRIMER_FREQ]
[--cf COPY_FIELDS [COPY_FIELDS ...]]
[--act {min,max,sum,set,majority} [{min,max,sum,set,majority} ...]]
[--dep] [--maxdiv MAX_DIVERSITY | --maxerror MAX_ERROR]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-n
<min_count>
¶ The minimum number of sequences needed to define a valid consensus.
-
--bf
<barcode_field>
¶ Position of description barcode field to group sequences by.
-
-q
<min_qual>
¶ Consensus quality score cut-off under which an ambiguous character is assigned; does not apply when quality scores are unavailable.
-
--freq
<min_freq>
¶ Fraction of character occurrences under which an ambiguous character is assigned.
-
--maxgap
<max_gap>
¶ If specified, this defines a cut-off for the frequency of allowed gap values for each position. Positions exceeding the threshold are deleted from the consensus. If not defined, positions are always retained.
-
--pf
<primer_field>
¶ Specifies the field name of the primer annotations
-
--prcons
<primer_freq>
¶ Specify to define a minimum primer frequency required to assign a consensus primer, and filter out sequences with minority primers from the consensus building step.
-
--cf
<copy_fields>
¶ Specifies a set of additional annotation fields to copy into the consensus sequence annotations.
-
--act
{min,max,sum,set,majority}
¶ List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” combines annotations into a comma delimited list of unique values and adds an annotation named <FIELD>_COUNT specifying the count of each item in the set. The action “majority” assigns the most frequent annotation to the consensus annotation and adds an annotation named <FIELD>_FREQ specifying the frequency of the majority value.
-
--dep
¶
Specify to calculate consensus quality with a non-independence assumption
-
--maxdiv
<max_diversity>
¶ Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and remove groups exceeding the given diversity threshold. Diversity is calculate for all positions within the read group, ignoring any character filtering imposed by the -q, –freq and –maxgap arguments. Mutually exclusive with –maxerror.
-
--maxerror
<max_error>
¶ Specify to calculate the error rate of each read group (rate of mismatches from consensus) and remove groups exceeding the given error threshold. The error rate is calculated against the final consensus sequence, which may include masked positions due to the -q and –freq arguments and may have deleted positions due to the –maxgap argument. Mutually exclusive with –maxdiv.
- output files:
- consensus-pass
- consensus reads.
- consensus-fail
- raw reads failing consensus filtering criteria.
- output annotation fields:
- PRIMER
- a comma delimited list of unique primer annotations found within the barcode read group.
- PRCOUNT
- a comma delimited list of the corresponding counts of unique primer annotations.
- PRCONS
- the majority primer within the barcode read group.
- PRFREQ
- the frequency of the majority primer.
- CONSCOUNT
- the count of reads within the barcode read group which contributed to the consensus sequence. This is the total size of the read group, minus sequence excluded due to user defined filtering criteria.
ClusterSets¶
Cluster sequences by group
usage: ClusterSets [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- cluster-pass
- clustered reads.
- cluster-fail
- raw reads failing clustering.
- output annotation fields:
- CLUSTER
- a numeric cluster identifier defining the within-group cluster.
ClusterSets all¶
Cluster all sequences regardless of annotation.
usage: ClusterSets all [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] [--nproc NPROC]
[-k CLUSTER_FIELD] [--ident IDENT]
[--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
[--cluster {usearch,vsearch,cd-hit-est}]
[--exec CLUSTER_EXEC] [--start SEQ_START]
[--end SEQ_END]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-k
<cluster_field>
¶ The name of the output annotation field to add with the cluster information for each sequence.
-
--ident
<ident>
¶ The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.
-
--length
<length_ratio>
¶ The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.
-
--prefix
<cluster_prefix>
¶ A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.
-
--cluster
{usearch,vsearch,cd-hit-est}
¶ The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.
-
--exec
<cluster_exec>
¶ The name or path of the usearch, vsearch or cd-hit-est executable.
-
--start
<seq_start>
¶ The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.
-
--end
<seq_end>
¶ The end of the region to be used for clustering.
ClusterSets barcode¶
Cluster reads by clustering barcode sequences.
usage: ClusterSets barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
[--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
[--cluster {usearch,vsearch,cd-hit-est}]
[--exec CLUSTER_EXEC] [-f BARCODE_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-k
<cluster_field>
¶ The name of the output annotation field to add with the cluster information for each sequence.
-
--ident
<ident>
¶ The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.
-
--length
<length_ratio>
¶ The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.
-
--prefix
<cluster_prefix>
¶ A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.
-
--cluster
{usearch,vsearch,cd-hit-est}
¶ The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.
-
--exec
<cluster_exec>
¶ The name or path of the usearch, vsearch or cd-hit-est executable.
-
-f
<barcode_field>
¶ The annotation field containing barcode sequences.
ClusterSets set¶
Cluster sequences within annotation sets.
usage: ClusterSets set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
[--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
[--cluster {usearch,vsearch,cd-hit-est}]
[--exec CLUSTER_EXEC] [-f SET_FIELD]
[--start SEQ_START] [--end SEQ_END]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-k
<cluster_field>
¶ The name of the output annotation field to add with the cluster information for each sequence.
-
--ident
<ident>
¶ The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.
-
--length
<length_ratio>
¶ The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.
-
--prefix
<cluster_prefix>
¶ A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.
-
--cluster
{usearch,vsearch,cd-hit-est}
¶ The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.
-
--exec
<cluster_exec>
¶ The name or path of the usearch, vsearch or cd-hit-est executable.
-
-f
<set_field>
¶ The annotation field containing annotations, such as UMI barcode, for sequence grouping.
-
--start
<seq_start>
¶ The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.
-
--end
<seq_end>
¶ The end of the region to be used for clustering.
CollapseSeq¶
Removes duplicate sequences from FASTA/FASTQ files
usage: CollapseSeq [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] [-n MAX_MISSING]
[--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
[--cf COPY_FIELDS [COPY_FIELDS ...]]
[--act {min,max,sum,set} [{min,max,sum,set} ...]] [--inner]
[--keepmiss] [--maxf MAX_FIELD | --minf MIN_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-n
<max_missing>
¶ Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides.
-
--uf
<uniq_fields>
¶ Specifies a set of annotation fields that must match for sequences to be considered duplicates.
-
--cf
<copy_fields>
¶ Specifies a set of annotation fields to copy into the unique sequence output.
-
--act
{min,max,sum,set}
¶ List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.
-
--inner
¶
If specified, exclude consecutive missing characters at either end of the sequence.
-
--keepmiss
¶
If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.
-
--maxf
<max_field>
¶ Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.
-
--minf
<min_field>
¶ Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.
- output files:
- collapse-unique
- unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.
- collapse-duplicate
- raw reads which are duplicates of the sequences retained in the collapse-unique file.
- collapse-undetermined
- raw reads which were excluded from consideration due to having too many N characters in the sequence.
- output annotation fields:
- DUPCOUNT
- total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.
- <user defined>
- annotation fields specified by the –cf parameter.
ConvertHeaders¶
Converts sequence headers to the pRESTO format
usage: ConvertHeaders [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- convert-pass
- reads passing header conversion.
- convert-fail
- raw reads failing header conversion.
- output annotation fields:
- <format defined>
- the annotation fields added are specific to the header format of the input file.
ConvertHeaders 454¶
Converts Roche 454 sequence headers.
usage: ConvertHeaders 454 [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
ConvertHeaders genbank¶
- Converts NCBI GenBank and RefSeq
- sequence headers.
usage: ConvertHeaders genbank [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
ConvertHeaders generic¶
- Converts sequence headers without a known
- annotation system.
usage: ConvertHeaders generic [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
ConvertHeaders illumina¶
Converts Illumina sequence headers.
usage: ConvertHeaders illumina [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
ConvertHeaders imgt¶
- Converts sequence headers output by
- IMGT/GENE-DB.
usage: ConvertHeaders imgt [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] [--simple]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--simple
¶
If specified, only the allele name, and no other annotations, will appear in the converted sequence header.
ConvertHeaders migec¶
- Converts headers for consensus sequence generated
- by the MIGEC tool.
usage: ConvertHeaders migec [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
ConvertHeaders sra¶
Converts NCBI SRA or EMBL-EBI ENA sequence headers.
usage: ConvertHeaders sra [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
EstimateError¶
Calculates annotation set error rates
usage: EstimateError [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- error-position
- estimated error by read position.
- error-quality
- estimated error by the quality score assigned within the input file.
- error-nucleotide
- estimated error by nucleotide.
- error-set
- estimated error by annotation set size.
- distance-set
- pairwise hamming distances by annotation set.
- threshold-set
- thresholds from pairwise hamming distances for annotation sets.
- distance-barcode
- estimated error by pairwise hamming distances
- threshold-barcode
- thresholds from pairwise hamming distances for clustering barcodes
- output fields:
- POSITION
- read position with base zero indexing.
- Q
- Phred quality score.
- OBSERVED
- observed nucleotide value.
- REFERENCE
- consensus nucleotide for the barcode read group.
- SET_COUNT
- barcode read group size.
- REPORTED_Q
- mean Phred quality score reported within the input file for the given position, quality score, nucleotide or read group.
- MISMATCHES
- count of observed mismatches from consensus for the given position, quality score, nucleotide or read group.
- OBSERVATIONS
- total count of observed values for each position, quality score, nucleotide or read group size.
- ERROR
- estimated error rate.
- EMPIRICAL_Q
- estimated error rate converted to a Phred quality score.
- ALL
- histogram (count) of all pairwise distance distribution.
- DTN
- histogram (count) of distance to nearest distribution.
- DISTANCE
- length normalized hamming distance.
EstimateError barcode¶
Calculates pairwise distance metrics of barcode sequences.
usage: EstimateError barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--delim DELIMITER DELIMITER DELIMITER]
[-f BARCODE_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<barcode_field>
¶ The name of the barcode field.
EstimateError set¶
Estimates error statistics within annotation sets.
usage: EstimateError set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-f SET_FIELD] [-n MIN_COUNT]
[--mode {freq,qual}] [-q MIN_QUAL] [--freq MIN_FREQ]
[--maxdiv MAX_DIVERSITY]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-f
<set_field>
¶ The name of the annotation field to group sequences by
-
-n
<min_count>
¶ The minimum number of sequences needed to consider a set
-
--mode
{freq,qual}
¶ Specifies which method to use to determine the consensus sequence. The “freq” method will determine the consensus by nucleotide frequency at each position and assign the most common value. The “qual” method will weight values by their quality scores to determine the consensus nucleotide at each position.
-
-q
<min_qual>
¶ Consensus quality score cut-off under which an ambiguous character is assigned.
-
--freq
<min_freq>
¶ Fraction of character occurrences under which an ambiguous character is assigned.
-
--maxdiv
<max_diversity>
¶ Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and exclude groups which exceed the given diversity threshold.
FilterSeq¶
Filters sequences in FASTA/FASTQ files
usage: FilterSeq [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- <command>-pass
- reads passing filtering operation and modified accordingly, where <command> is the name of the filtering operation that was run.
- <command>-fail
- raw reads failing filtering criteria, where <command> is the name of the filtering operation.
- output annotation fields:
- None
FilterSeq length¶
Filters reads by length.
usage: FilterSeq length [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--nproc NPROC] [-n MIN_LENGTH] [--inner]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-n
<min_length>
¶ Minimum sequence length to retain.
-
--inner
¶
If specified exclude consecutive missing characters at either end of the sequence.
FilterSeq maskqual¶
Masks low quality positions.
usage: FilterSeq maskqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--nproc NPROC] [-q MIN_QUAL]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-q
<min_qual>
¶ Quality score threshold.
FilterSeq missing¶
Filters reads by N or gap character count.
usage: FilterSeq missing [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--nproc NPROC] [-n MAX_MISSING] [--inner]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-n
<max_missing>
¶ Threshold for fraction of gap or N nucleotides.
-
--inner
¶
If specified exclude consecutive missing characters at either end of the sequence.
FilterSeq quality¶
Filters reads by quality score.
usage: FilterSeq quality [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--nproc NPROC] [-q MIN_QUAL] [--inner]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-q
<min_qual>
¶ Quality score threshold.
-
--inner
¶
If specified exclude consecutive missing characters at either end of the sequence.
FilterSeq repeats¶
Filters reads by consecutive nucleotide repeats.
usage: FilterSeq repeats [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--nproc NPROC] [-n MAX_REPEAT] [--missing]
[--inner]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-n
<max_repeat>
¶ Threshold for fraction of repeating nucleotides.
-
--missing
¶
If specified count consecutive gap and N characters ‘ in addition to {A,C,G,T}.
-
--inner
¶
If specified exclude consecutive missing characters at either end of the sequence.
FilterSeq trimqual¶
Trims sequences by quality score decay.
usage: FilterSeq trimqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--nproc NPROC] [-q MIN_QUAL]
[--win WINDOW] [--reverse]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-q
<min_qual>
¶ Quality score threshold.
-
--win
<window>
¶ Nucleotide window size for moving average calculation.
-
--reverse
¶
Specify to trim the head of the sequence rather than the tail.
MaskPrimers¶
Removes primers and annotates sequences with primer and barcode identifiers
usage: MaskPrimers [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- mask-pass
- processed reads with successful primer matches.
- mask-fail
- raw reads failing primer identification.
- output annotation fields:
- SEQORIENT
- the orientation of the output sequence. Either F (input) or RC (reverse complement of input).
- PRIMER
- name of the best primer match.
- BARCODE
- the sequence preceding the primer match. Only output when the –barcode flag is specified.
MaskPrimers align¶
Find primer matches using pairwise local alignment.
usage: MaskPrimers align [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] -p PRIMER_FILE [--maxerror MAX_ERROR]
[--maxlen MAX_LEN] [--gap GAP_PENALTY GAP_PENALTY]
[--revpr] [--skiprc] [--mode {cut,mask,trim,tag}]
[--barcode] [--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-p
<primer_file>
¶ A FASTA file containing primer sequences.
-
--maxerror
<max_error>
¶ Maximum allowable error rate.
-
--maxlen
<max_len>
¶ Length of the sequence window to scan for primers.
-
--gap
<gap_penalty>
¶ A list of two positive values defining the gap open and gap extension penalties for aligning the primers. Note: the error rate is calculated as the percentage of mismatches from the primer sequence with gap penalties reducing the match count accordingly; this may lead to error rates that differ from strict mismatch percentage when gaps are present in the alignment.
-
--revpr
¶
Specify to match the tail-end of the sequence against the reverse complement of the primers. This also reverses the behavior of the –maxlen argument, such that the search window begins at the tail-end of the sequence.
-
--skiprc
¶
Specify to prevent checking of sample reverse complement sequences.
-
--mode
{cut,mask,trim,tag}
¶ Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region and the preceding sequence. The “mask” mode will replace the primer region with Ns and remove the preceding sequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The “tag” mode will leave the input sequence unmodified.
-
--barcode
¶
Specify to annotate reads sequences with barcode sequences (unique molecular identifiers) found preceding the primer.
-
--bf
<barcode_field>
¶ Name of the barcode annotation field.
-
--pf
<primer_field>
¶ Name of the annotation field containing the primer name.
MaskPrimers extract¶
Remove and annotate a fixed sequence region.
usage: MaskPrimers extract [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [--start START] --len LENGTH
[--revpr] [--mode {cut,mask,trim,tag}] [--barcode]
[--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--start
<start>
¶ The starting position of the sequence region to extract.
-
--len
<length>
¶ The length of the sequence to extract.
-
--revpr
¶
Specify to extract from the tail end of the sequence with the start position relative to the end of the sequence.
-
--mode
{cut,mask,trim,tag}
¶ Specifies the action to take with the sequence region. The “cut” mode will remove the region. The “mask” mode will replace the specified region with Ns. The “trim” mode will remove the sequence preceding the specified region, but leave the region intact. The “tag” mode will leave the input sequence unmodified.
-
--barcode
¶
Specify to remove the sequence preceding the extracted region and annotate the read with that sequence.
-
--bf
<barcode_field>
¶ Name of the barcode annotation field.
-
--pf
<primer_field>
¶ Name of the annotation field containing the extracted sequence region.
MaskPrimers score¶
Find primer matches by scoring primers at a fixed position.
usage: MaskPrimers score [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] -p PRIMER_FILE [--start START]
[--maxerror MAX_ERROR] [--revpr]
[--mode {cut,mask,trim,tag}] [--barcode]
[--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-p
<primer_file>
¶ A FASTA file containing primer sequences.
-
--start
<start>
¶ The starting position of the primer.
-
--maxerror
<max_error>
¶ Maximum allowable error rate.
-
--revpr
¶
Specify to match the tail-end of the sequence against the reverse complement of the primers. This also reverses the behavior of the –start argument, such that start position is relative to the tail-end of the sequence.
-
--mode
{cut,mask,trim,tag}
¶ Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region and the preceding sequence. The “mask” mode will replace the primer region with Ns and remove the preceding sequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The “tag” mode will leave the input sequence unmodified.
-
--barcode
¶
Specify to annotate reads sequences with barcode sequences (unique molecular identifiers) found preceding the primer.
-
--bf
<barcode_field>
¶ Name of the barcode annotation field.
-
--pf
<primer_field>
¶ Name of the annotation field containing the primer name.
PairSeq¶
Sorts and matches sequence records with matching coordinates across files
usage: PairSeq [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
SEQ_FILES_2 [SEQ_FILES_2 ...] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--1f FIELDS_1 [FIELDS_1 ...]] [--2f FIELDS_2 [FIELDS_2 ...]]
[--act {min,max,sum,set,cat}]
[--coord {illumina,solexa,sra,454,presto}]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-1
<seq_files_1>
¶ An ordered list of FASTA/FASTQ files containing head/primary sequences.
-
-2
<seq_files_2>
¶ An ordered list of FASTA/FASTQ files containing tail/secondary sequences.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--1f
<fields_1>
¶ The annotation fields to copy from file 1 records into file 2 records. If a copied annotation already exists in a file 2 record, then the annotations copied from file 1 will be added to the front of the existing annotation.
-
--2f
<fields_2>
¶ The annotation fields to copy from file 2 records into file 1 records. If a copied annotation already exists in a file 1 record, then the annotations copied from file 2 will be added to the end of the existing annotation.
-
--act
{min,max,sum,set,cat}
¶ The collapse actions to take on all fields copied between files to combine duplicate fields into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string. Only applies if the field already exists in the header before being copying from the other file.
-
--coord
{illumina,solexa,sra,454,presto}
¶ The format of the sequence identifier which defines shared coordinate information across mate pairs.
- output files:
- pair-pass
- successfully paired reads with modified annotations.
- pair-fail
- raw reads that could not be assigned to a mate-pair.
- output annotation fields:
- <user defined>
- annotation fields specified by the –1f or –2f arguments.
ParseHeaders¶
Parses pRESTO annotations in FASTA/FASTQ sequence headers
usage: ParseHeaders [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- reheader-pass
- reads passing annotation operation and modified accordingly.
- reheader-fail
- raw reads failing annotation operation.
- headers
- tab delimited table of the selected annotations.
- output annotation fields:
- <user defined>
- annotation fields specified by the -f argument.
ParseHeaders add¶
Adds field/value pairs to header annotations
usage: ParseHeaders add [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...] -u VALUES [VALUES ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to add.
-
-u
<values>
¶ List of values to add for each field.
ParseHeaders collapse¶
Collapses header annotations with multiple entries
usage: ParseHeaders collapse [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...] --act
{min,max,sum,first,last,set,cat}
[{min,max,sum,first,last,set,cat} ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to collapse.
-
--act
{min,max,sum,first,last,set,cat}
¶ List of actions to take for each field defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.
ParseHeaders copy¶
Copies header annotation fields
usage: ParseHeaders copy [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...] -k NAMES [NAMES ...]
[--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to copy.
-
-k
<names>
¶ List of names for each copied field. If the new field is already present, the copied field will be merged into the existing field.
-
--act
{min,max,sum,first,last,set,cat}
¶ List of collapse actions to take on each new field following the copy operation defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.
ParseHeaders delete¶
Deletes fields from header annotations
usage: ParseHeaders delete [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to delete.
ParseHeaders expand¶
Expands annotation fields with multiple values
usage: ParseHeaders expand [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...] [--sep SEPARATOR]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to expand.
-
--sep
<separator>
¶ The character separating each value in the fields.
ParseHeaders merge¶
Merge multiple annotations fields into a single field
usage: ParseHeaders merge [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...] -k NAME [--act {min,max,sum,set,cat}]
[--delete]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to merge.
-
-k
<name>
¶ Name for the merged field. If the new field is already present, the merged fields will be merged into the existing field.
-
--act
{min,max,sum,set,cat}
¶ List of collapse actions to take on the new field following the merge defining how to combine the annotations into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.
-
--delete
¶
If specified, delete the field that were merged from the output header.
ParseHeaders rename¶
Renames header annotation fields
usage: ParseHeaders rename [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...] -k NAMES [NAMES ...]
[--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to rename.
-
-k
<names>
¶ List of new names for each field. If the new field is already present, the renamed field will be merged into the existing field and the old field will be deleted.
-
--act
{min,max,sum,first,last,set,cat}
¶ List of collapse actions to take on each new field following the rename operation defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.
ParseHeaders table¶
Writes sequence headers to a table
usage: ParseHeaders table [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--failed]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
[FIELDS ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<fields>
¶ List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.
ParseLog¶
Parses records in the console log of pRESTO modules
usage: ParseLog [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--delim DELIMITER DELIMITER DELIMITER] -l RECORD_FILES
[RECORD_FILES ...] -f FIELDS [FIELDS ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-l
<record_files>
¶ List of log files to parse.
-
-f
<fields>
¶ List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.
- output files:
- table
- tab delimited table of the selected annotations.
- output annotation fields:
- <user defined>
- annotation fields specified by the -f argument.
SplitSeq¶
Sorts, samples and splits FASTA/FASTQ sequence files
usage: SplitSeq [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- part<part>
- reads partitioned by count, where <part> is the partition number.
- <field>-<value>
- reads partitioned by annotation <field> and <value>.
- under-<number>
- reads partitioned by numeric threshold where the annotation value is strictly less than the threshold <number>.
- atleast-<number>
- reads partitioned by numeric threshold where the annotation value is greater than or equal to the threshold <number>.
- sorted
- reads sorted by annotation value.
- sorted-part<part>
- reads sorted by annotation value and partitioned by count, where <part> is the partition number.
- sample<i>-n<count>
- randomly sampled reads where <i> is a number specifying the sampling instance and <count> is the number of sampled reads.
- selected
- reads passing selection criteria.
- output annotation fields:
- None
SplitSeq count¶
Splits sequences files by number of records.
usage: SplitSeq count [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta] -n
MAX_COUNT
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
-n
<max_count>
¶ Maximum number of sequences in each new file
SplitSeq group¶
Splits sequences files by annotation.
usage: SplitSeq group [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELD
[--num THRESHOLD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<field>
¶ Annotation field to split sequence files by
-
--num
<threshold>
¶ Specify to define the split field as numeric and group sequences by value.
SplitSeq sample¶
Randomly samples from unpaired sequences files.
usage: SplitSeq sample [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -n MAX_COUNT
[MAX_COUNT ...] [-f FIELD] [-u VALUES [VALUES ...]]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-n
<max_count>
¶ Maximum number of sequences to sample from each file, field or annotation set. The default behavior, without the -f argument, is to sample from the complete set of sequences in the input file.
-
-f
<field>
¶ The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.
-
-u
<values>
¶ If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.
SplitSeq samplepair¶
Randomly samples from paired-end sequences files.
usage: SplitSeq samplepair [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...]
-2 SEQ_FILES_2 [SEQ_FILES_2 ...] [--outdir OUT_DIR]
[--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -n
MAX_COUNT [MAX_COUNT ...] [-f FIELD]
[-u VALUES [VALUES ...]]
[--coord {illumina,solexa,sra,454,presto}]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-1
<seq_files_1>
¶ An ordered list of FASTA/FASTQ files containing head/primary sequences.
-
-2
<seq_files_2>
¶ An ordered list of FASTA/FASTQ files containing tail/secondary sequences.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-n
<max_count>
¶ Maximum number of paired sequences to sample from each set of files, fields or annotations. The default behavior, without the -f argument, is to sample from the complete set of paired sequences in the input files.
-
-f
<field>
¶ The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.
-
-u
<values>
¶ If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.
-
--coord
{illumina,solexa,sra,454,presto}
¶ The format of the sequence identifier which defines shared coordinate information across paired read files.
SplitSeq select¶
Selects sequences from sequence files by annotation.
usage: SplitSeq select [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELD
[-u VALUE_LIST [VALUE_LIST ...] | -t VALUE_FILE]
[--not]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<field>
¶ The annotation field for selection criteria.
-
-u
<value_list>
¶ A list of values to select for in the specified field. Mutually exclusive with -t.
-
-t
<value_file>
¶ A tab delimited file specifying values to select for in the specified field. The file must be formatted with the given field name in the header row. Values will be taken from that column. Mutually exclusive with -u.
-
--not
¶
If specified, will perform negative matching. Meaning, sequences will be selected if they fail to match for all specified values.
SplitSeq sort¶
Sorts sequences files by annotation.
usage: SplitSeq sort [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELD
[-n MAX_COUNT] [--num]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
-f
<field>
¶ The annotation field to sort sequences by.
-
-n
<max_count>
¶ Maximum number of sequences in each new file.
-
--num
¶
Specify to define the sort field as numeric rather than textual.
UnifyHeaders¶
Unifies annotation fields based on grouping scheme
usage: UnifyHeaders [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- unify-pass
- Reads passing annotation filtering or consensus.
- unify-fail
- Reading failing filtering.
- output annotation fields:
- <user defined>
- annotation fields specified by the -f and -k arguments.
UnifyHeaders consensus¶
Reassign fields to consensus values.
usage: UnifyHeaders consensus [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE] [--failed] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-f SET_FIELD] [-k UNIFY_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-f
<set_field>
¶ The annotation field containing annotations, such as the UMI barcode, for sequence grouping.
-
-k
<unify_field>
¶ The name of the annotation field to find a consensus for per each sequence group.
UnifyHeaders delete¶
Delete sequences with differing field values.
usage: UnifyHeaders delete [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-f SET_FIELD] [-k UNIFY_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-s
<seq_files>
¶ A list of FASTA/FASTQ files containing sequences to process.
-
-o
<out_files>
¶ Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--fasta
¶
Specify to force output as FASTA rather than FASTQ.
-
--delim
<delimiter>
¶ A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-f
<set_field>
¶ The annotation field containing annotations, such as the UMI barcode, for sequence grouping.
-
-k
<unify_field>
¶ The name of the annotation field to find a consensus for per each sequence group.
API¶
presto.Annotation¶
Annotation functions
-
presto.Annotation.
addHeader
(header, fields, values, delimiter=('|', '=', ', '))¶ Adds fields and values to a sequence header
Parameters: - header – an annotation dictionary returned by parseAnnotation.
- fields – the list of fields to add or append to.
- values – the list of annotation values to add for each field.
- delimiter – a tuple of delimiters for (fields, values, value lists).
Returns: modified header dictionary.
Return type:
-
presto.Annotation.
annotationConsensus
(seq_iter, field, delimiter=('|', '=', ', '))¶ Calculate a consensus annotation for a set of sequences
Parameters: - seq_iter – an iterator or list of SeqRecord objects
- field – the annotation field to take a consensus of
- delimiter – a tuple of delimiters for (annotations, field/values, value lists)
Returns: - Dictionary with keys
set containing a list of unique annotation values, count containing annotation counts, cons containing the consensus annotation, freq containing the majority annotation frequency
Return type:
-
presto.Annotation.
collapseAnnotation
(ann_dict, action, fields=None, delimiter=('|', '=', ', '))¶ Collapses multiple annotations into new single annotations for each field
Parameters: - ann_dict – dictionary of field/value pairs
- action – collapse action to take; one of {min, max, sum, first, last, set, cat}
- fields – subset of ann_dict to _collapse; if None, collapse all but the ID field
- delimiter – Tuple of delimiters for (fields, values, value lists)
Returns: Modified field dictionary
Return type: OrderedDict
-
presto.Annotation.
collapseHeader
(header, fields, actions, delimiter=('|', '=', ', '))¶ Collapses a sequence header
Parameters: - header – an annotation dictionary returned by parseAnnotation.
- fields – the list of fields to collapse.
- actions – the list of collapse action take; one of (max, min, sum, first, last, set, cat) for each field.
- delimiter – a tuple of delimiters for (fields, values, value lists).
Returns: modified header dictionary.
Return type:
-
presto.Annotation.
convert454Header
(desc)¶ Parses 454 headers into the pRESTO format
Parameters: desc (str) – a sequence description string. Returns: a dictionary of header field and value pairs. Return type: dict Examples
New style 454 header:
@<accession> <length=##> @GXGJ56Z01AE06X length=222
Old style 454 header:
@<rank_x_y> <length=##> <uaccno=accession> @000034_0199_0169 length=437 uaccno=GNDG01201ARRCR
-
presto.Annotation.
convertGenbankHeader
(desc, delimiter=('|', '=', ', '))¶ Converts GenBank and RefSeq headers into the pRESTO format
Parameters: Returns: a dictionary of header field and value pairs.
Return type: Examples
New style GenBank header:
<accession>.<version> <description> >CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly
Old style GenBank header:
gi|<GI record number>|<dbsrc>|<accession>.<version>|<description> >gi|568336023|gb|CM000663.2| Homo sapiens chromosome 1, GRCh38 reference primary assembly
-
presto.Annotation.
convertGenericHeader
(desc, delimiter=('|', '=', ', '))¶ Converts any header to the pRESTO format
Parameters: Returns: a dictionary of header field and value pairs.
Return type:
-
presto.Annotation.
convertIMGTHeader
(desc, simple=False)¶ Converts germline headers from IMGT/GENE-DB into the pRESTO format
Parameters: Returns: a dictionary of header field and value pairs.
Return type: Examples
IMGT header:
>X60503|IGHV1-18*02|Homo sapiens|F|V-REGION|142..417|276 nt|1| | | | |276+24=300|partial in 3'| |
Header contains 15 fields separated by
|
(http://imgt.org/genedb):- IMGT/LIGM-DB accession number(s).
- Gene and allele name.
- Species.
- Functionality.
- Exon(s), region name(s), or extracted label(s).
- Start and end positions in the IMGT/LIGM-DB accession number(s).
- Number of nucleotides in the IMGT/LIGM-DB accession number(s).
- Codon start, or ‘NR’ (not relevant) for non coding labels and out-of-frame pseudogenes.
- Number of nucleotides added in
5'
compared to the corresponding label extracted from IMGT/LIGM-DB. - Number of nucleotides added or removed in
3'
compared to the corresponding label extracted from IMGT/LIGM-DB. - Number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or ‘not corrected’ if non corrected sequencing errors.
- Number of amino acids (AA). This field indicates that the sequence is in amino acids.
- Number of characters in the sequence. Nucleotides (or AA) plus IMGT gaps.
- Partial (if it is).
- Reverse complementary (if it is).
-
presto.Annotation.
convertIlluminaHeader
(desc)¶ Converts Illumina headers into the pRESTO format
Parameters: desc (str) – a sequence description string. Returns: a dictionary of header field and value pairs. Return type: dict Examples
New style Illumina header:
@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read number>:<is filtered>:<control number>:<index sequence> @MISEQ:132:000000000-A2F3U:1:1101:14340:1555 2:N:0:ATCACG
Old style Illumina header:
@<instrument>:<flowcell lane>:<tile>:<x-pos>:<y-pos>#<index sequence>/<read number> @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
-
presto.Annotation.
convertMIGECHeader
(desc)¶ Parses headers from the MIGEC tool into the pRESTO format
Parameters: desc (str) – a sequence description string. Returns: a dictionary of header field and value pairs. Return type: dict Examples
MIGEC header:
@MIG UMI:<UMI sequence>:<consensus read count> @MIG UMI:TCGGCCAACAAA:8
-
presto.Annotation.
convertSRAHeader
(desc)¶ Parses NCBI SRA or EMBL-EBI ENA headers into the pRESTO format
Parameters: desc (str) – a sequence description string. Returns: a dictionary of header field and value pairs. Return type: dict Examples
Header from
fastq-dump --split-files
:@<accession>.<spot> <original sequence description> <length=#> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 @SRR1383326.1 1 length=250
Header from
fastq-dump --split-files -I
:@<accession>.<spot>.<read number> <original sequence description> <length=#> @SRR1383326.1.1 1 length=250
Header from ENA:
@<accession>.<spot> <original sequence description> @ERR220397.1 HKSQ1MM01DXT2W/3 @ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/1 @ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/2
-
presto.Annotation.
copyHeader
(header, fields, names, actions=None, delimiter=('|', '=', ', '))¶ Copies fields in a sequence header
Parameters: - header – an annotation dictionary returned by parseAnnotation.
- fields – a list of the field names to copy.
- names – a list of the new field names.
- actions – the list of collapse action take after the copy; one of (max, min, sum, first, last, set, cat) for each field.
- delimiter – a tuple of delimiters for (fields, values, value lists).
Returns: modified header dictionary.
Return type:
-
presto.Annotation.
deleteHeader
(header, fields, delimiter=('|', '=', ', '))¶ Deletes fields from a sequence header
Parameters: - header – an annotation dictionary returned by parseAnnotation.
- fields – the list of fields to delete.
- delimiter – a tuple of delimiters for (fields, values, value lists).
Returns: modified header dictionary
Return type:
-
presto.Annotation.
expandHeader
(header, fields, separator=', ', delimiter=('|', '=', ', '))¶ Splits and annotation value into separate fields in a sequence header
Parameters: - header – an annotation dictionary returned by parseAnnotation.
- fields – the field to split.
- separator – the delimiter to split the values by.
- delimiter – a tuple of delimiters for (fields, values, value lists).
Returns: modified header dictionary.
Return type:
-
presto.Annotation.
flattenAnnotation
(ann_dict, delimiter=('|', '=', ', '))¶ Converts annotations from a dictionary to a FASTA/FASTQ sequence description
Parameters: - ann_dict – Dictionary of field/value pairs
- delimiter – Tuple of delimiters for (fields, values, value lists)
Returns: Formatted sequence description string
Return type:
-
presto.Annotation.
getAnnotationValues
(seq_iter, field, unique=False, delimiter=('|', '=', ', '))¶ Gets the set of unique annotation values in a sequence set
Parameters: - seq_iter – Iterator or list of SeqRecord objects
- field – Annotation field to retrieve values for
- unique – If True return a list of only the unique values; if False return a list of all values
- delimiter – Tuple of delimiters for (fields, values, value lists)
Returns: List of values for the field
Return type:
-
presto.Annotation.
getCoordKey
(header, coord_type='presto', delimiter=('|', '=', ', '))¶ Return the coordinate identifier for a sequence description
Parameters: - header – Sequence header string
- coord_type – Sequence header format; one of [‘illumina’, ‘solexa’, ‘sra’, ‘454’, ‘presto’]; if unrecognized type or None return sequence ID.
- delimiter – Tuple of delimiters for (fields, values, value lists)
Returns: Coordinate identifier as a string
Return type:
-
presto.Annotation.
mergeAnnotation
(ann_dict_1, ann_dict_2, prepend=False, delimiter=('|', '=', ', '))¶ Merges non-ID field annotations from one field dictionary into another
Parameters: - ann_dict_1 – Dictionary of field/value pairs to append to
- ann_dict_2 – Dictionary of field/value pairs to merge with ann_dict_2
- prepend – If True then add ann_dict_2 values to the front of any ann_dict_1 values that are already present, rather than the default behavior of appending ann_dict_2 values.
- delimiter – Tuple of delimiters for (fields, values, value lists)
Returns: Modified ann_dict_1 dictonary of field/value pairs
Return type: OrderedDict
-
presto.Annotation.
mergeHeader
(header, fields, name, action=None, delete=False, delimiter=('|', '=', ', '))¶ Merges fields in a sequence header
Parameters: - header – an annotation dictionary returned by parseAnnotation.
- fields – a list of the field names to merge.
- name – the name of the new field.
- delete – if True delete the merged fields.
- actions – the list of collapse action take after the merge one of (max, min, sum, first, last, set, cat).
- delimiter – a tuple of delimiters for (fields, values, value lists)
Returns: modified header dictionary.
Return type:
-
presto.Annotation.
parseAnnotation
(record, fields=None, delimiter=('|', '=', ', '))¶ Extracts annotations from a FASTA/FASTQ sequence description
Parameters: - record – Description string to extract annotations from
- fields – List of fields to subset the return dictionary to; if None return all fields
- delimiter – a tuple of delimiters for (fields, values, value lists)
Returns: An OrderedDict of field/value pairs
Return type: OrderedDict
-
presto.Annotation.
parseLog
(record)¶ Parses an pRESTO log record
Parameters: record (str) – a string of lines representing a log record including newline characters. Returns: parsed log contain field and values pairs as a dictionary. Return type: collections.OrderedDict
-
presto.Annotation.
renameAnnotation
(ann_dict, old_field, new_field, delimiter=('|', '=', ', '))¶ Renames an annotation and merges annotations if the new name already exists
Parameters: - ann_dict – Dictionary of field/value pairs
- old_field – Old field name
- new_field – New field name
- delimiter – Tuple of delimiters for (fields, values, value lists)
Returns: Modified fields dictonary
Return type: OrderedDict
-
presto.Annotation.
renameHeader
(header, fields, names, actions=None, delimiter=('|', '=', ', '))¶ Renames fields in a sequence header
Parameters: - header – an annotation dictionary returned by parseAnnotation.
- fields – a list of the current field names.
- names – a list of the new field names.
- actions – the list of collapse action take after the rename; one of (max, min, sum, first, last, set, cat) for each field.
- delimiter – a tuple of delimiters for (fields, values, value lists).
Returns: modified header dictionary.
Return type:
presto.Applications¶
External application wrappers
-
presto.Applications.
makeBlastnDb
(ref_file, db_exec='makeblastdb')¶ Makes a ublast database file
Parameters: - ref_file – the path to the reference database file
- db_exec – the path to the makeblastdb executable
Returns: (name and location of the database, handle of the tempfile.TemporaryDirectory)
Return type:
-
presto.Applications.
makeUBlastDb
(ref_file, db_exec='usearch')¶ Makes a ublast database file
Parameters: - ref_file – path to the reference database file.
- db_exec – path to the usearch executable.
Returns: (location of the database, handle of the tempfile.NamedTemporaryFile)
Return type:
-
presto.Applications.
runBlastn
(seq, database, evalue=1e-05, max_hits=100, aligner_exec='blastn')¶ Aligns a sequence against a reference database using BLASTN
Parameters: - seq – a list of SeqRecord objects to align.
- database – the path and name of the blastn database.
- evalue – the E-value cut-off.
- maxhits – the maximum number of hits returned.
- aligner_exec – the path to the blastn executable.
Returns: Alignment results.
Return type: pandas.DataFrame
-
presto.Applications.
runCDHit
(seq_list, ident=0.9, length_ratio=0.0, seq_start=0, seq_end=None, max_memory=3000, threads=1, cluster_exec='cd-hit-est')¶ Cluster a set of sequences using CD-HIT
Parameters: - seq_list (list) – a list of SeqRecord objects to align.
- ident (float) – the sequence identity cutoff to be passed to cd-hit-est.
- length_ratio (float) – cd-hit-est parameter defining the minimum short/long length ratio allowed within a cluster.
- seq_start (int) – the start position to trim sequences at before clustering.
- seq_end (int) – the end position to trim sequences at before clustering.
- max_memory (int) – cd-hit-est max memory limit (Mb)
- threads (int) – number of threads for cd-hit-est.
- cluster_exec (str) – the path to the cd-hit-est executable.
Returns: {cluster id: list of sequence ids}.
Return type:
-
presto.Applications.
runMuscle
(seq_list, aligner_exec='muscle')¶ Multiple aligns a set of sequences using MUSCLE
Parameters: - seq_list – a list of SeqRecord objects to align
- aligner_exec – the MUSCLE executable
Returns: Multiple alignment results.
Return type: Bio.Align.MultipleSeqAlignment
-
presto.Applications.
runUBlast
(seq, database, evalue=1e-05, max_hits=100, aligner_exec='usearch')¶ Aligns a sequence against a reference database using the usearch_local algorithm of USEARCH
Parameters: - seq – a list of SeqRecord objects to align.
- database – the path to the ublast database or a fasta file.
- evalue – the E-value cut-off.
- maxhits – the maximum number of hits returned.
- aligner_exec – the path to the usearch executable.
Returns: Alignment results.
Return type: pandas.DataFrame
-
presto.Applications.
runUClust
(seq_list, ident=0.9, length_ratio=0.0, seq_start=0, seq_end=None, threads=1, cluster_exec='usearch')¶ Cluster a set of sequences using the UCLUST algorithm from USEARCH
Parameters: - seq_list (list) – a list of SeqRecord objects to align.
- ident (float) – the sequence identity cutoff to be passed to usearch.
- length_ratio (float) – usearch parameter defining the minimum short/long length ratio allowed within a cluster.
- seq_start (int) – the start position to trim sequences at before clustering.
- seq_end (int) – the end position to trim sequences at before clustering.
- threads (int) – number of threads for usearch.
- cluster_exec (str) – the path to the usearch executable.
Returns: {cluster id: list of sequence ids}.
Return type:
presto.Commandline¶
Commandline interface functions
-
class
presto.Commandline.
CommonHelpFormatter
(prog, indent_increment=2, max_help_position=24, width=None)¶ Bases:
argparse.RawDescriptionHelpFormatter
,argparse.ArgumentDefaultsHelpFormatter
Custom argparse.HelpFormatter
-
presto.Commandline.
checkArgs
(parser)¶ Checks that arguments have been provided and prints help if they have not.
Parameters: parser – An argparse.ArgumentParser defining the commandline arguments. Returns: True if arguments are present. Prints help and exits if not. Return type: boolean
-
presto.Commandline.
getCommonArgParser
(seq_in=True, seq_out=True, seq_paired=False, db_in=False, db_out=False, out_file=True, failed=True, log=True, annotation=True, multiproc=False, add_help=True)¶ Defines an ArgumentParser object with common pRESTO arguments
Parameters: - seq_in (bool) – if True include sequence input arguments.
- seq_out (bool) – if True include sequence output arguments.
- seq_paired (bool) – if True defined paired-end sequence input and output arguments.
- db_in (bool) – if True include tab-delimited database input arguments.
- db_out (bool) – if True include tab-delimited database output arguments.
- out_file (bool) – if True add explicit output file name arguments.
- failed (bool) – If True include arguments for output of failed results.
- log (bool) – If True include log arguments.
- annotation (bool) – If True include annotation arguments.
- multiproc (bool) – If True include multiprocessing arguments.
- add_help (bool) – If True add help and version arguments.
Returns: ArgumentParser object.
Return type:
-
presto.Commandline.
parseCommonArgs
(args, in_arg=None, in_types=None)¶ Checks common arguments from getCommonArgParser and transforms output options to a dictionary
Parameters: - args – Argument Namespace defined by ArgumentParser.parse_args
- in_arg – String defining a non-standard input file argument to verify; by default [‘db_files’, ‘seq_files’, ‘seq_files_1’, ‘seq_files_2’, ‘primer_file’] are supported in that order
- in_types – List of types (file extensions as strings) to allow for files in file_arg if None do not check type
Returns: Dictionary copy of args with output arguments embedded in the dictionary out_args
Return type:
presto.IO¶
File I/O and logging functions
-
presto.IO.
countSeqFile
(seq_file)¶ Counts the records in FASTA/FASTQ files
Parameters: seq_file – FASTA or FASTQ file containing sample sequences Returns: Count of records in the sequence file Return type: int
-
presto.IO.
countSeqSets
(seq_file, field='BARCODE', delimiter=('|', '=', ', '))¶ Identifies sets of sequences with the same ID field
Parameters: - seq_file – FASTA or FASTQ file containing sample sequences
- field – Annotation field containing set IDs
- delimiter – Tuple of delimiters for (fields, values, value lists)
Returns: Count of unit set IDs in the sequence file
Return type:
-
presto.IO.
getFileType
(filename)¶ Determines the type of a file by file extension
Parameters: filename – Filename Returns: String defining the sequence type for SeqIO operations Return type: str
-
presto.IO.
getOutputHandle
(in_file, out_label=None, out_dir=None, out_name=None, out_type=None)¶ Opens an output file handle
Parameters: - in_file – Input filename
- out_label – Text to be inserted before the file extension; if None do not add a label
- out_type – the file extension of the output file; if None use input file extension
- out_dir – the output directory; if None use directory of input file
- out_name – the short filename to use for the output file; if None use input file short name
Returns: File handle
Return type: file
-
presto.IO.
printCount
(current, step, start_time=None, task=None, end=False)¶ Prints a progress bar to standard out
Parameters: - current (int) – count of completed tasks.
- step (int) – an int defining the progress increment to print at.
- start_time (time.time) – task start time returned by time.time(); if None do not add run time to progress
- task (str) – name of task to display.
- end (bool) – if True print final log (add newline).
-
presto.IO.
printDebug
(message, debug=True)¶ Prints a debug message to standard error
Parameters:
-
presto.IO.
printError
(message, exit=True)¶ Prints an error to standard error and exits
Parameters:
-
presto.IO.
printLog
(record, handle=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, inset=None)¶ Formats a dictionary into a log string
Parameters: - record – a dict or OrderedDict of field names mapping to values.
- handle – the file handle to write the log to; if None do not write to file.
- inset – minimum field name inset; if None automatically space field names.
Returns: Formatted multi-line string in the log format.
Return type:
-
presto.IO.
printMessage
(message, start_time=None, end=False, width=25)¶ Prints a progress message to standard out
Parameters: - message – Current task message
- start_time – task start time returned by time.time(); if None do not add run time to progress
- end – If True print final message (add newline)
- width – Maximum number of characters for messages
-
presto.IO.
printProgress
(current, total, step, start_time=None, task=None, end=False)¶ Prints a progress bar to standard out
Parameters: - current (int) – count of completed tasks.
- total (int) – total task count.
- step (float) – float defining the fractional progress increment to print at.
- start_time (float) – task start time returned by time.time(); if None do not add run time to progress
- task (str) – name of task to display.
- end (bool) – if True print final log (add newline).
-
presto.IO.
printWarning
(message)¶ Prints a warning to standard error
Parameters: message (str) – warning message.
-
presto.IO.
readPrimerFile
(primer_file)¶ Processes primer sequences from file
Parameters: primer_file (str) – name of the FASTA file containing primer sequences. Returns: Dictionary mapping primer ID to sequence. Return type: dict
-
presto.IO.
readReferenceFile
(ref_file)¶ Create a dictionary of cleaned and ungapped reference sequences.
Parameters: ref_file – reference sequences in fasta format. Returns: - cleaned and ungapped reference sequences;
- with the key as the sequence ID and value as a Bio.SeqRecord for each reference sequence.
Return type: dict
-
presto.IO.
readSeqFile
(seq_file, index=False, key_func=None)¶ Reads FASTA/FASTQ files
Parameters: - seq_file – FASTA or FASTQ file containing sample sequences
- index – If True return a dictionary from SeqIO.index(); if False return an iterator from SeqIO.parse()
- key_func – the key_function argument to pass to SeqIO.index if index=True
Returns: an interator of SeqRecords if index=False. A dict if True.
Return type: iter
presto.Multiprocessing¶
Multiprocessing functions
-
class
presto.Multiprocessing.
SeqData
(id, data)¶ Bases:
object
Class defining sequence data objects for worker processes
-
id
¶ unique identifier
-
data
¶ single object or a list of data objects.
-
valid
¶ if True data is suitable for processing.
-
-
class
presto.Multiprocessing.
SeqResult
(id, data)¶ Bases:
object
Class defining sequence result objects for collector processes
-
id
¶ unique identifier
-
data
¶ single unprocessed object or a list of unprocessed data objects.
-
results
¶ single processed object or a list of processed data objects.
-
valid
¶ if True processing was successful.
-
log
¶ dictionary containing the processing log.
-
-
presto.Multiprocessing.
collectPairQueue
(alive, result_queue, collect_queue, seq_file_1, seq_file_2, label, out_file=None, out_args={'delimiter': ('|', '=', ', '), 'failed': True, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': None, 'separator': ', '})¶ Pulls from results queue, assembles results and manages log and file IO
Parameters: - alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns.
- result_queue – a multiprocessing.Queue holding worker results.
- collect_queue – a multiprocessing.Queue holding collector return values.
- seq_file_1 – the first sequence file name.
- seq_file_2 – the second sequence file name.
- label – task label used to tag the output files.
- out_file – output file name. Automatically generated from the input file if None.
- out_args – common output argument dictionary from parseCommonArgs.
Returns: adds a dictionary of {log: log object, out_files: output file names} to collect_queue.
Return type:
-
presto.Multiprocessing.
collectSeqQueue
(alive, result_queue, collect_queue, seq_file, label, index_field=None, out_file=None, out_args={'delimiter': ('|', '=', ', '), 'failed': True, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': None, 'separator': ', '})¶ Pulls from results queue, assembles results and manages log and file IO
Parameters: - alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns.
- result_queue – Multiprocessing.Queue holding worker results.
- collect_queue – Multiprocessing.Queue to store collector return values.
- seq_file – sample sequence file name.
- label – task label used to tag the output files.
- out_file – output file name. Automatically generated from the input file if None.
- out_args – Common output argument dictionary from parseCommonArgs.
- index_field – Field defining set membership for sequence sets if None data queue contained individual records.
Returns: - Adds a dictionary with key value pairs to collect_queue containing
’log’ defining a log object, ‘out_files’ defining the output file names
Return type:
-
presto.Multiprocessing.
feedPairQueue
(alive, data_queue, seq_file_1, seq_file_2, coord_type='presto', delimiter=('|', '=', ', '))¶ Feeds the data queue with sequence pairs for processQueue processes
Parameters: - alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns
- data_queue – an multiprocessing.Queue to hold data for processing
- seq_file_1 – the name of sequence file 1
- seq_file_2 – the name of sequence file 2
- coord_type – the sequence header format
- delimiter – a tuple of delimiters for (fields, values, value lists)
Returns: None
-
presto.Multiprocessing.
feedSeqQueue
(alive, data_queue, seq_file, index_func=None, index_args={})¶ Feeds the data queue with SeqRecord objects
Parameters: - alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns
- data_queue – multiprocessing.Queue to hold data for processing
- seq_file – Sequence file to read input from
- index_func – Function to use to define sequence sets if None do not index sets and feed individual records
- index_args – Dictionary of arguments to pass to index_func
Returns: None
-
presto.Multiprocessing.
manageProcesses
(feed_func, work_func, collect_func, feed_args={}, work_args={}, collect_args={}, nproc=None, queue_size=None)¶ Manages feeder, worker and collector processes
Parameters: - feed_func (function) – Data Queue feeder function.
- work_func (function) – Worker function.
- collect_func (function) – Result Queue collector function.
- feed_args (dict) – Dictionary of arguments to pass to feed_func.
- work_args (dict) – Dictionary of arguments to pass to work_func.
- collect_args (dict) – Dictionary of arguments to pass to collect_func.
- nproc (int) – Number of processQueue processes; if None defaults to the number of CPUs
- queue_size (int) – Maximum size of the argument queue; if None defaults to 2*nproc
Returns: Dictionary of collector results
Return type:
-
presto.Multiprocessing.
processSeqQueue
(alive, data_queue, result_queue, process_func, process_args={})¶ Pulls from data queue, performs calculations, and feeds results queue
Parameters: - alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns
- data_queue – multiprocessing.Queue holding data to process
- result_queue – multiprocessing.Queue to hold processed results
- process_func – function to use for processing sequences
- process_args – Dictionary of arguments to pass to process_func
Returns: None
presto.Sequence¶
Sequence processing functions
-
class
presto.Sequence.
AssemblyRecord
(seq=None)¶ Bases:
object
A class defining a paired-end assembly result
-
overlap
¶
-
-
class
presto.Sequence.
AssemblyStats
(n)¶ Bases:
object
Class containing p-value and z-score matrices for scoring assemblies
-
class
presto.Sequence.
PrimerAlignment
(seq=None)¶ Bases:
object
A class defining a primer alignment result
-
presto.Sequence.
alignAssembly
(head_seq, tail_seq, alpha=1e-05, max_error=0.3, min_len=8, max_len=1000, scan_reverse=False, assembly_stats=None, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})¶ Stitches two sequences together by aligning the ends
Parameters: - head_seq – the head SeqRecord.
- head_seq – the tail SeqRecord.
- alpha – the minimum p-value for a valid assembly.
- max_error – the maximum error rate for a valid assembly.
- min_len – minimum length of overlap to test.
- max_len – maximum length of overlap to test.
- scan_reverse – if True allow the head sequence to overhang the end of the tail sequence if False end alignment scan at end of tail sequence or start of head sequence.
- assembly_stats – optional successes by trials numpy.array of p-values.
- score_dict – optional dictionary of character scores in the . form {(char1, char2): score}.
Returns: assembled sequence object.
Return type:
-
presto.Sequence.
calculateDiversity
(seq_list, score_dict=getDNAScoreDict())¶ Determine the average pairwise error rate for a list of sequences
Parameters: - seq_list – List of SeqRecord objects to score
- score_dict – Optional dictionary of alignment scores as {(char1, char2): score}
Returns: Average pairwise error rate for the list of sequences
Return type:
-
presto.Sequence.
calculateSetError
(seq_list, ref_seq, ignore_chars=['n', 'N'], score_dict=getDNAScoreDict())¶ Counts the occurrence of nucleotide mismatches from a reference in a set of sequences
Parameters: - seq_list – list of SeqRecord objects with aligned sequences.
- ref_seq – SeqRecord object containing the reference sequence to match against.
- ignore_chars – list of characters to exclude from mismatch counts.
- score_dict – optional dictionary of alignment scores as {(char1, char2): score}.
Returns: error rate for the set.
Return type:
-
presto.Sequence.
checkSeqEqual
(seq1, seq2, ignore_chars={'-', '.', 'N', 'n'})¶ Determine if two sequences are equal, excluding missing positions
Parameters: - seq1 – SeqRecord object
- seq2 – SeqRecord object
- ignore_chars – Set of characters to ignore
Returns: True if the sequences are equal
Return type:
-
presto.Sequence.
compilePrimers
(primers)¶ Translates IUPAC Ambiguous Nucleotide characters to regular expressions and compiles them
Parameters: key – Dictionary of sequences to translate Returns: Dictionary of compiled regular expressions Return type: dict
-
presto.Sequence.
consensusUnify
(data, field, delimiter=('|', '=', ', '))¶ Reassigns all annotations to the consensus annotation in group
Parameters: - data – SeqData object contain sequences to process.
- field – field containing annotations to collapse.
- delimiter – a tuple of delimiters for (annotations, field/values, value lists).
Returns: modified sequences.
Return type:
-
presto.Sequence.
deleteSeqPositions
(seq, positions)¶ Deletes a list of positions from a SeqRecord
Parameters: - seq – SeqRecord objects
- positions – Set of positions (indices) to delete
Returns: Modified SeqRecord with the specified positions removed
Return type: SeqRecord
-
presto.Sequence.
deletionUnify
(data, field, delimiter=('|', '=', ', '))¶ Removes all sequences with differing annotations in a group
Parameters: - data – SeqData object contain sequences to process.
- field – field containing annotations to collapse.
- delimiter – a tuple of delimiters for (annotations, field/values, value lists).
Returns: modified sequences.
Return type:
-
presto.Sequence.
extractAlignment
(seq_record, start, length, rev_primer=False)¶ Extracts a subsequence from sequence
Parameters: - data – SeqRecord to process.
- start – position where subsequence starts.
- length – the length of the subsequence to extract.
- rev_primer – if True extract relative to the tail end of the sequence.
Returns: extraction results as an alignment object
Return type:
-
presto.Sequence.
filterLength
(data, min_length=250, inner=True, missing_chars='N.n-')¶ Filters sequences by length
Parameters: Returns: SeqResult object.
Return type:
-
presto.Sequence.
filterMissing
(data, max_missing=10, inner=True, missing_chars='N.n-')¶ Filters sequences by number of missing nucleotides
Parameters: Returns: SeqResult object.
Return type:
-
presto.Sequence.
filterQuality
(data, min_qual=0, inner=True, missing_chars='N.n-')¶ Filters sequences by quality score
Parameters: Returns: SeqResult object.
Return type:
-
presto.Sequence.
filterRepeats
(data, max_repeat=15, include_missing=False, inner=True, missing_chars='N.n-')¶ Filters sequences by fraction of ambiguous nucleotides
Parameters: - data (SeqData) – a SeqData object with a single SeqRecord to process.
- max_repeat (int) – the maximum number of allowed repeating characters.
- include_missing (int) – if True count ambiguous character repeats; if False do not consider ambiguous character repeats.
- inner (int) – if True exclude outer missing characters from calculation.
- missing_chars (str) – a string of missing character values.
Returns: SeqResult object.
Return type:
-
presto.Sequence.
findGapPositions
(seq_list, max_gap, gap_chars={'-', '.'})¶ Finds positions in a set of aligned sequences with a high number of gap characters.
Parameters: - seq_list – List of SeqRecord objects with aligned sequences
- max_gap – Float of the maximum gap frequency to consider a position as non-gapped
- gap_chars – Set of characters to consider as gaps
Returns: Positions (indices) with gap frequency greater than max_gap
Return type:
-
presto.Sequence.
frequencyConsensus
(seq_list, min_freq=0.6, ignore_chars={'-', '.', 'N', 'n'})¶ Builds a consensus sequence from a set of sequences
Parameters: - set_seq – List of SeqRecord objects
- min_freq – Frequency cutoff to assign a base
- ignore_chars – Set of characters to exclude when building a consensus sequence
Returns: Consensus SeqRecord object
Return type: SeqRecord
-
presto.Sequence.
getAAScoreDict
(mask_score=None, gap_score=None)¶ Generates a score dictionary
Parameters: - mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
- gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns: Score dictionary with keys (char1, char2) mapping to scores
Return type:
-
presto.Sequence.
getDNAScoreDict
(mask_score=None, gap_score=None)¶ Generates a score dictionary
Parameters: - mask_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
- gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns: Score dictionary with keys (char1, char2) mapping to scores
Return type:
-
presto.Sequence.
indexSeqSets
(seq_dict, field='BARCODE', delimiter=('|', '=', ', '))¶ Identifies sets of sequences with the same ID field
Parameters: - seq_dict – a dictionary index of sequences returned from SeqIO.index()
- field – the annotation field containing set IDs
- delimiter – a tuple of delimiters for (fields, values, value lists)
Returns: Dictionary mapping set name to a list of record names
Return type:
-
presto.Sequence.
joinAssembly
(head_seq, tail_seq, gap=0, insert_seq=None)¶ Concatenates two sequences
Parameters: - head_seq – the head SeqRecord.
- tail_seq – the tail SeqRecord.
- gap – number of gap characters to insert between head and tail ignored if insert_seq is not None.
- insert_seq – a string or Bio.Seq.Seq object, to insert between the head and tail; if None insert with N characters.
Returns: assembled sequence object.
Return type:
-
presto.Sequence.
localAlignment
(seq_record, primers, primers_regex=None, max_error=0.3, max_len=1000, rev_primer=False, skip_rc=False, gap_penalty=(1, 1), score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 0, ('N', 'B'): 0, ('N', 'C'): 0, ('N', 'D'): 0, ('N', 'G'): 0, ('N', 'H'): 0, ('N', 'K'): 0, ('N', 'M'): 0, ('N', 'N'): 0, ('N', 'R'): 0, ('N', 'S'): 0, ('N', 'T'): 0, ('N', 'V'): 0, ('N', 'W'): 0, ('N', 'Y'): 0, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})¶ Performs pairwise local alignment of a list of short sequences against a long sequence
Parameters: - seq_record – a SeqRecord object to align primers against
- primers – dictionary of {names: short IUPAC ambiguous sequence strings}
- primers_regex – optional dictionary of {names: compiled primer regular expressions}
- max_error – maximum acceptable error rate before aligning reverse complement
- max_len – maximum length of sample sequence to align
- rev_primer – if True align with the tail end of the sequence
- skip_rc – if True do not check reverse complement sequences
- gap_penalty – a tuple of positive (gap open, gap extend) penalties
- score_dict – optional dictionary of alignment scores as {(char1, char2): score}
Returns: primer alignment result object
Return type:
-
presto.Sequence.
maskQuality
(data, min_qual=0)¶ Masks characters by in sequence by quality score
Parameters: Returns: SeqResult object.
Return type:
-
presto.Sequence.
maskSeq
(align, mode='mask', barcode=False, barcode_field='BARCODE', primer_field='PRIMER', delimiter=('|', '=', ', '))¶ Create an output sequence with primers masked or cut
Parameters: - align – a PrimerAlignment object.
- mode – defines the action taken; one of ‘cut’, ‘mask’, ‘tag’ or ‘trim’.
- barcode – if True add sequence preceding primer to description.
- barcode_field – name of the output barcode annotation.
- primer_field – name of the output primer annotation.
- delimiter – a tuple of delimiters for (annotations, field/values, value lists).
Returns: masked sequence.
Return type: Bio.SeqRecord.SeqRecord
-
presto.Sequence.
overlapConsensus
(head_seq, tail_seq, ignore_chars={'-', '.', 'N', 'n'})¶ Creates a consensus overlap sequences from two segments
Parameters: - head_seq – the overlap head SeqRecord.
- tail_seq – the overlap tail SeqRecord.
- ignore_chars – list of characters which do not contribute to consensus.
Returns: A SeqRecord object with consensus characters and quality scores.
Return type: SeqRecord
-
presto.Sequence.
qualityConsensus
(seq_list, min_qual=0, min_freq=0.6, dependent=False, ignore_chars={'-', '.', 'N', 'n'})¶ Builds a consensus sequence from a set of sequences
Parameters: - seq_list – List of SeqRecord objects
- min_qual – Quality cutoff to assign a base
- min_freq – Frequency cutoff to assign a base
- dependent – If False assume sequences are independent for quality calculation
- ignore_chars – Set of characters to exclude when building a consensus sequence
Returns: Consensus SeqRecord object
Return type: SeqRecord
-
presto.Sequence.
referenceAssembly
(head_seq, tail_seq, ref_dict, ref_db, min_ident=0.5, evalue=1e-05, max_hits=100, fill=False, aligner='usearch', aligner_exec='usearch', score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})¶ Stitches two sequences together by aligning against a reference database
Parameters: - head_seq – the head SeqRecord.
- head_seq – the tail SeqRecord.
- ref_dict – a dictionary of reference SeqRecord objects.
- ref_db – the path and name of the reference database.
- min_ident – the minimum identity for a valid assembly.
- evalue – the E-value cut-off for ublast.
- max_hits – the maxhits output limit for ublast.
- fill – if False non-overlapping regions will be assigned Ns; if True non-overlapping regions will be filled with the reference sequence.
- aligner – the alignment tool; one of ‘blastn’ or ‘usearch’.
- aligner_exec – the path to the alignment tool executable.
- score_dict – optional dictionary of character scores in the form {(char1, char2): score}.
Returns: assembled sequence object.
Return type:
-
presto.Sequence.
reverseComplement
(seq)¶ Takes the reverse complement of a sequence
Parameters: seq – a SeqRecord object, Seq object or string to reverse complement Returns: Object of the same type as the input with the reverse complement sequence Return type: Seq
-
presto.Sequence.
scoreAA
(a, b, mask_score=None, gap_score=None)¶ Returns the score for a pair of IUPAC Extended Protein characters
Parameters: - a – First character
- b – Second character
- mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
- gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns: Score for the character pair
Return type:
-
presto.Sequence.
scoreAlignment
(seq_record, primers, start=0, rev_primer=False, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 0, ('N', 'B'): 0, ('N', 'C'): 0, ('N', 'D'): 0, ('N', 'G'): 0, ('N', 'H'): 0, ('N', 'K'): 0, ('N', 'M'): 0, ('N', 'N'): 0, ('N', 'R'): 0, ('N', 'S'): 0, ('N', 'T'): 0, ('N', 'V'): 0, ('N', 'W'): 0, ('N', 'Y'): 0, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})¶ Performs a simple fixed position alignment of primers
Parameters: - seq_record – a SeqRecord object to align primers against
- primers – dictionary of {names: short IUPAC ambiguous sequence strings}
- start – position where primer alignment starts
- rev_primer – if True align with the tail end of the sequence
- score_dict – optional dictionary of {(char1, char2): score} alignment scores
Returns: primer alignment result object
Return type:
-
presto.Sequence.
scoreDNA
(a, b, mask_score=None, gap_score=None)¶ Returns the score for a pair of IUPAC Ambiguous Nucleotide characters
Parameters: - a – First characters
- b – Second character
- n_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
- gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns: Score for the character pair
Return type:
-
presto.Sequence.
scoreSeqPair
(seq1, seq2, ignore_chars={}, score_dict={('-', '-'): 1, ('-', '.'): 1, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 1, ('.', '.'): 1, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})¶ Determine the error rate for a pair of sequences
Parameters: - seq1 – SeqRecord object
- seq2 – SeqRecord object
- ignore_chars – Set of characters to ignore when scoring and counting the weight
- score_dict – Optional dictionary of alignment scores
Returns: Tuple of the (score, minimum weight, error rate) for the pair of sequences
Return type: Tuple
-
presto.Sequence.
sequentialAssembly
(head_seq, tail_seq, ref_dict, ref_db, alpha=1e-05, max_error=0.3, min_len=8, max_len=1000, scan_reverse=False, min_ident=0.5, evalue=1e-05, max_hits=100, fill=False, aligner='usearch', aligner_exec='usearch', assembly_stats=None, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})¶ Stitches sequences together by first attempting de novo assembly then falling back to reference guided assembly
Parameters: - head_seq – the head SeqRecord
- head_seq – the tail SeqRecord
- ref_dict – a dictionary of reference SeqRecord objects
- ref_db – the path and name of the reference database
- alpha – the minimum p-value for a valid de novo assembly
- max_error – the maximum error rate for a valid de novo assembly
- min_len – minimum length of overlap to test for de novo assembly
- max_len – maximum length of overlap to test for de novo assembly
- scan_reverse – if True allow the head sequence to overhang the end of the tail sequence in de novo assembly if False end alignment scan at end of tail sequence or start of head sequence
- min_ident – the minimum identity for a valid reference guided assembly
- evalue – the E-value cut-off for reference guided assembly
- max_hits – the maxhits output limit for reference guided assembly
- fill – if False non-overlapping regions will be assigned Ns in reference guided assembly; if True non-overlapping regions will be filled with the reference sequence.
- aligner – the alignment tool; one of ‘blastn’ or ‘usearch’
- aligner_exec – the path to the alignment tool executable
- assembly_stats – optional successes by trials numpy.array of p-values
- score_dict – optional dictionary of character scores in the form {(char1, char2): score}.
Returns: assembled sequence object.
Return type:
-
presto.Sequence.
subsetSeqIndex
(seq_dict, field, values, delimiter=('|', '=', ', '))¶ Subsets a sequence set by annotation value
Parameters: - seq_dict – Dictionary index of sequences returned from SeqIO.index()
- field – Annotation field to select keys by
- values – List of annotation values that define the retained keys
- delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns: List of keys
Return type:
-
presto.Sequence.
subsetSeqSet
(seq_iter, field, values, delimiter=('|', '=', ', '))¶ Subsets a sequence set by annotation value
Parameters: - seq_iter – Iterator or list of SeqRecord objects
- field – Annotation field to select by
- values – List of annotation values that define the retained sequences
- delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns: Modified list of SeqRecord objects
Return type:
-
presto.Sequence.
translateAmbigDNA
(key)¶ Translates IUPAC Ambiguous Nucleotide characters to or from character sets
Parameters: key – String or re.search object containing the character set to translate Returns: Character translation Return type: str
-
presto.Sequence.
trimQuality
(data, min_qual=0, window=10, reverse=False)¶ Cuts sequences using a moving mean quality score
Parameters: Returns: SeqResult object.
Return type:
Release Notes¶
Version 0.5.13: August 29, 2019¶
- Fixed
.ix
pandas deprecation warning.
Version 0.5.12: August 5, 2019¶
- Increased pandas version requirement to v0.24+ with compatibility fixes for pandas v0.24.0.
Version 0.5.11: January 30, 2019¶
- Slightly changed version number display in commandline help.
ClusterSets:
- Removed the
--log
and--failed
arguments from theall
andbarcode
subcommands because they do nothing. - Added the
--length
argument to all subcommands which defines the ratio of minimum sequence lengths allowed in a cluster. - Added error handling for when
--ident
is less than recommended value. - Increased maximum memory allocation for cd-hit-est to 3GB.
Version 0.5.10: October 19, 2018¶
- Documentation added for UMI error correction.
- Added
IO.printDebug
to the API.
EstimateError:
- Added the subcommands
set
andbarcode
whereset
is the previous error estimation method (using UMI read groups). The newbarcode
method generates pairwise Hamming distance distributions for sequences containied in header annotations (typically UMI barcode sequences). - Added
distance
andthreshold
output files containing pairwise distance histograms and clustering thresholds for use with the corresponding subcommands in ClusterSets. - Increased default minimum sequence count in the
set
subcommand to 20.
Version 0.5.9: September 2, 2018¶
- Add the
-o
argument to most tools, which allows explicit declaration of the output file name. - Added
IO.printWarning
andIO.printError
API functions for handling standard error messaging. - Split
IO.printProgress
API function intoIO.printProgress
(percentage) andIO.printCount
(raw counts). - Moved a significant number of functions and classes from the executable scripts into the API.
MaskPrimers:
- Removed support for the regex primer file format.
AssemblePairs:
- Changed default of the
--rc
argument totail
and addednone
option for previous default.
Version 0.5.8: July 13, 2018¶
- Fixed installation incompatibility with pip 10.
ClusterSets:
- Added support for CD-HIT.
EstimateError:
- Significant performance improvements.
Version 0.5.7: March 19, 2018¶
BuildConsensus:
- Fixed an error wherein the program would exit if all sequences in an UMI read group had a Phred quality score of 0 in a given position.
ConvertHeaders:
- Added support for EMBL-EBI ENA header format to the sra subcommand.
MaskPrimers:
- Added extract subcommand which will remove/annotation subsequences in fixed position without requiring a primer sequence match.
- Added
--pf
and--bf
arguments to all subcommands allowing renaming of outputPRIMER
andBARCODE
fields, respectively. - Removed
SEQORIENT
output field from score subcommand as the mode does not check the reverse complement.
PairSeq:
- Added the
--act
argument to provide a mechanism for collapsing values of duplicate fields copied across files.
ParseHeaders:
- Added merge subcommand to combined separate annotations into a single entry.
Version 0.5.6: January 17, 2018¶
CollapseSeq:
- Fixed a bug causing copy fields (
--cf
argument) to be processed incorrectly.
UnifyHeaders:
- Improved reporting in log file.
Version 0.5.5: December 26, 2017¶
AssemblePairs:
- Fixed a bug that caused the
align
subcommand to error if input sequences where shorter than the minimum specified by the –minlen argument. It will now simply fail such sequences.
ClusterSets:
- Moved functionality of previous ClusterSets command into the
set
subcommand. - Added the
all
subcommand to cluster all sequences without considering annotation groups. - Added the
barcode
subcommand which allows for clustering of reads based on a barcode sequence instead of the read data. - Renamed
-id
argument to--ident
for consistency with AssemblePairs.
CollapseSeq:
- Fixed a bug wherein CollapseSeq would match partial sequences against longer sequences that were otherwise identical up until the missing end characters.
- Added detailed log output.
EstimateError:
- Fixed a division by zero warning in the log output when there were no observed mismatches.
UnifyHeaders:
- New tool to generate consensus annotations or filter reads based on annotation groupings.
Version 0.5.4: July 1, 2017¶
- All tools will now print detailed help if no arguments are provided.
AlignSets:
- Fixed a typo in the console log of the muscle subcommand.
ConvertHeaders:
- Added the
migec
subcommand to convert headers from the MIGEC tool.
EstimateError:
- Fixed a division by zero error when there were no observed mismatches.
- Bounded error rate to a minimum of 10^-9 (Q=90).
Version 0.5.3: February 14, 2017¶
License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
AssemblePairs:
- Changed the behavior of the
--failed
argument so that failed output are in the same orientation as the input sequences. Meaning, the--rc
argument is ignored for failed output. - Added the sequential subcommand which will first attempt de novo assembly (align subcommand) following by reference guided assembly (reference subcommand) if de novo assembly fails.
- Added blastn compatibility to reference subcommand.
- Added the option
--aligner
to the reference subcommand to allow use of either blastn or usearch for performing the local alignment. Defaults to the usearch algorithm used in previous releases. - Added the option
--dbexec
to the reference subcommand to allow specification of the reference database build tool (eg, makeblastdb). - Changed masking behavior to
none
and word length to9
in reference subcommand when using usearch as the aligner. - Internal modifications to the reference subcommand to rebuild the database before alignments for performance reasons.
- Fixed a deprecation warning appearing with newer versions of numpy.
BuildConsensus:
- Fixed a bug in the read group error rate calculation wherein either a consensus sequence or read group that was completely N characters would cause the program to exit with a division by zero error. Now, such non-informative read groups will be assigned an error rate of 1.0.
ClusterSets:
- Added vsearch compatibility.
- Fixed a bug wherein sets containing empty sequences were being fed to usearch, rather than automatically failed, which would cause usearch v8 to hang indefinitely.
- Fixed an incompatibility with usearch v9 due to changes in the way usearch outputs sequence labels.
- Changed masking behavior of usearch to
none
. - Changed how gaps are handling before passing sequences to usearch. Gaps are now masked (with Ns) for clustering, instead of removed.
EstimateError:
- Fixed a fatal error with newer versions of pandas.
SplitSeq:
- Added the select subcommand, which allows filtering of sequences based on annotation value matches or mismatches.
- Altered the behavior of the
-u
argument for both the sample and samplepair subcommands. If-u
is specified, sampling is performed as in previous versions wherein samples will be drawn from only fields with the specified annotation values up ton
total reads. However, if-u
is not specified with-f
repeated sampling will now be performed for each unique annotation value in the specified field, generating output with up ton
reads per unique annotation value.
Version 0.5.2: March 8, 2016¶
Fixed a bug with installation on Windows due to old file paths lingering in presto.egg-info/SOURCES.txt.
Improvements to commandline usage help messages.
Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0.
AssemblePairs:
- Added the flag
--fill
to the reference subcommand to allow insertion of the reference sequence into the non-overlapping region of assembled sequences. Use caution when using this flag, as this may lead to chimeric sequences. - Changed default
--minlen
to 8 in align subcommand.
Version 0.5.0: September 7, 2015¶
Conversion to a proper Python package which uses pip and setuptools for installation.
The package now requires Python 3.4. Python 2.7 is not longer supported.
The required dependency versions have been bumped to numpy 1.8, scipy 0.14, pandas 0.15, and biopython 1.65.
IgCore:
- Divided IgCore functionality into the separate modules: Annotation, Commandline, Defaults, IO, Multiprocessing and Sequence.
Version 0.4.8: September 7, 2015¶
Added support for additional input FASTA (.fna, .fa), FASTQ (.fq) and tab-delimited (.tsv) file extensions.
ParseHeaders:
- Fixed a bug in the rename subcommand wherein renaming to an existing field deleted the old annotation, but did not merge the renamed annotation into the existing field.
- Added the copy subcommand which will copy annotations into new field names or merge the annotations of existing fields.
- Added the
--act
argument to the copy and rename subcommands allowing collapse following the copy or rename operation. - Added a commandline check to ensure that the
-f
,-k
and--act
arguments contain the same number of fields for both the rename and copy subcommands.
Version 0.4.7: June 5, 2015¶
IgCore:
- Modified scoring functions to permit asymmetrical scores for N and gap characters.
AssemblePairs:
- Added support for SRA style coordinate information where the where the read number has been appended to the spot number.
- Altered scoring so gap characters are counted as mismatches in the error rate and identity calculations.
BuildConsensus:
- Altered scoring so gap characters are counted as mismatches in the diversity and error rate calculations.
ConvertHeaders:
- Added support for SRA style sequence headers where the read number has been
appended to the spot number; eg, output from
fastq-dump -I --split-files file.sra
.
ClusterSets:
- Added missing OUTPUT console log field.
- Changed
--bf
and--cf
arguments to-f
and-k
, respectively.
MaskPrimers:
- Altering scoring behavior for N characters such that Ns in the input sequence are always counted as a mismatch, while Ns in the primer sequence are counted as a match, with priority given to the input sequence score.
- Added
--gap
argument to the align subcommand which allows users to specify the gap open and gap extension penalties for aligning primers. Note: gap penalties reduce the match count for purposes of calculating ERROR.
PairSeq:
- Added support for SRA style coordinate information where the where the read number has been appended to the spot number.
Version 0.4.6: May 13, 2015¶
BuildConsensus:
- Changed
--maxmiss
argument to--maxgap
and altered the behavior to only perform deletion of positions based on gap characters (only “-” or “.” and not “N” characters). - Added an error rate (
--maxerror
) calculation based on mismatches from consensus. The--maxerror
argument is mutually exclusive with the--maxdiv
argument and provides similar functionality. However, the calculations are not equivalent, and--maxerror
should be considerably faster than--maxdiv
. - Added exclusion of positions from the error rate calculation that are deleted
due to exceeding the
--maxgap
threshold . - Fixed misalignment of consensus sequence against input sequences when
positions are deleted due to exceeding the
--maxgap
threshold.
ClusterSets:
- New script to cluster read groups by barcode field (eg, UID barcodes) into clustering within the read group.
ConvertHeaders:
- New script to handle conversion of different sequence description formats to the pRESTO format.
FilterSeq:
- Added count of masked characters to log output of maskqual subcommand.
- Changed repeats subcommand log field REPEAT to REPEATS.
PairSeq:
- Changed
-f
argument to--1f
argument. - Added
--2f
argument to copy file 2 annotations to file 1.
ParseHeaders:
- Moved convert subcommand to the generic subcommand of the new ConvertHeaders script and modified the conversion behavior.
Version 0.4.5: March 20, 2015¶
Added details to the usage documentation for each tool which describes both the output files and annotation fields.
Renamed --clean
argument to --failed
argument with opposite behavior,
such that the default behavior of all scripts is now clean output.
IgCore:
- Features added for Change-O compatibility.
- Features added for PairSeq performance improvements.
- Added custom help formatter.
- Modifications to internals of multiprocessing code.
- Fixed a few typos in error messages.
AssemblePairs:
- Added reference subcommand which uses V-region germline alignments from ublast to assemble paired-ends.
- Removed mate-pair matching operation to increase performance. Now requires both input files to contain matched and uniformly ordered reads. If files are not synchronized then PairSeq must be run first. AssemblePairs will check that coordinate info matches and error if the files are not synchronized. Unpaired reads are no longer output.
- Added support for cases where one mate pair is the subsequence of the other.
- Added
--scanrev
argument to allow for head sequence to overhang end of tail. - Removed truncated (quick) error calculation in align subcommand.
- Changed default values of the
--maxerror
and--alpha
arguments of the align subcommand to better tuned parameters. - Changed internal selection of top scoring alignment to use Z-score approximation rather than a combination of error rate and binomial mid-p value.
- Internal changes to multiprocessing structure.
- Changed inserted gap character from - to N in join subcommand for better compatibility with the behavior of IMGT/HighV-QUEST.
- Changed PVAL log field to PVALUE.
- Changed HEADSEQ and TAILSEQ log fields to SEQ1 and SEQ2.
- Changed HEADFIELDS and TAILFIELDS log fields to FIELDS1 and FIELDS2.
- Changed precision of ERROR and PVALUE log fields.
- Added more verbose logging.
BuildConsensus:
- Fixed bug where low quality positions where not being masked in single sequence barcode groups.
- Added copy field (
--cf
) and copy action (--act
) arguments to generate consensus annotations for barcode read groups. - Changed maximum consensus quality score from 93 to 90.
CollapseSeq:
- Added
--keep
argument to allow retention of sequences with high missing character counts in unique sequence output file. - Removed case insensitivity for performance reasons. Now requires all sequences to have matching case.
- Removed
first
andlast
from--act
choices to avoid unexpected behavior.
MaskPrimers:
- Changed behavior of N characters in primer identification. Ns now count as a match against any character, rather than a mismatch.
- Changed behavior of mask mode such that positions masked with Ns are now assigned quality scores of 0, rather than retaining their previous scores.
- Fixed a bug with the align subcommand where deletions within the input sequence (gaps in the alignment) were causing an incorrect barcode start position.
PairSeq:
- Performance improvements. The tool should now be considerably faster on very large files.
- Specifying the
--failed
argument to request output of sequences which do not have a mate pair will increase run time and memory usage.
ParseHeaders:
- Add ‘cat’ action to collapse subcommand which concatenates strings into a single annotation.
SplitSeq:
- Removed
--clean
(and--failed
) flag from all subcommands. - Added progress updates to sample and samplepair subcommands.
- Performance improvements to samplepair subcommand.
Version 0.4.4: June 10, 2014¶
SplitSeq:
- Removed a linux-specific dependency, allowing SplitSeq to work on Windows.
Version 0.4.3: April 7, 2014¶
CollapseSeq:
- Fixed bug that occurs with Python 2.7.5 on OS X.
SplitSeq:
- Fixed bug in samplepairs subcommand that occurs with Python 2.7.5 on OS X.
Version 0.4.2: March 20, 2014¶
Increased verbosity of exception reporting.
IgCore:
- Updates to consensus functions to support changes to BuildConsensus.
AssemblePairs:
- Set default alpha to 0.01.
BuildConsensus:
- Added support for
--freq value
parameter to quality consensus method and set default value to 0.6. - Fixed a bug in the frequency consensus method where missing values were contributing to the total character count at each position.
- Added the parameter
--maxmiss value
which provides a cut-off for removal of positions with too many N or gap characters .
MaskPrimers:
- Renamed the
--reverse
parameter to--revpr
.
SplitSeq:
- Removed convert subcommand.
Version 0.4.1: January 27, 2014¶
Changes to the internals of multiple tools to provide support for multiprocessing in Windows environments.
Changes to the internals of multiple tools to provide clean exit of child processes upon kill signal or exception in sibling process.
Fixed unexpected behavior of --outname
and --log
arguments with
multiple input files.
IgCore:
- Added reporting of unknown exceptions when reading sequence files
- Fixed scoring of lowercase sequences.
AlignSets:
- Fixed a typo in the log output.
BuildConsensus:
- Fixed a typo in the log output.
EstimateError:
- Fixed bug where tool would improperly exit if no sets passed threshold criteria.
- Fixed typo in console output.
MaskPrimers:
- Added
trim
mode which will cut the region before primer alignment, but leave primer region unmodified. - Fixed a bug with lowercase sequence data.
- Fixed bug in the console and log output.
- Added support for primer matching when setting
--maxerr 1.0
.
ParseHeaders:
- Added count of sequences without any valid fields (FAIL) to console output.
ParseLog:
- Added count of records without any valid fields (FAIL) to console output.
SplitSeq:
- Fixed typo in console output of samplepair subcommand.
- Added increase of the open file limit to the group subcommand to allow for a large number of groups.
Version 0.4.0: September 30, 2013¶
Minor name changes were made to multiple scripts, functions, parameters, and output files.
AlignSets, AssemblePairs, BuildConsensus, EstimateError, FilterSeq, and
MaskPrimers are now multithreaded. The number of simultaneous processes
may be specified using --nproc value
. Note this means file ordering
is no longer preserved between the input and output sequence files.
Performance improvements were made to several tools.
The universal --verbose
parameter was replaced with --log file_name
which specifies a log file for verbose output, and disables verbose logging
if not specified.
The report of input parameters and sequence counts is now separate from the log and is always printed to standard output.
Added a progress bar to the standard output of most tools.
Added a universal --outname file_prefix
parameter which changes the leading
portion of the output file name. If not specified, the current file name
is used (excluding the file extension, as per the previous behavior).
Added a universal --clean
parameter which if specified forces the tool
not to create an output file of sequences which failed processing.
IgCore:
- Changes to parameters and internals of multiple functions.
- Added functions to support multithreading for single-end reads, paired-end reads, and barcode sets.
- Added safe annotation field renaming.
- Added progress bar, logging and output file name conversion support.
- Moved reusable AssemblePairs, BuildConsensus, PairSeq, and SplitSeq. operations into IgCore.
AssemblePairs:
- Coordinate information is now specified by a coordinate type, rather than a
delimiter, using the
--coord header_type
parameter, where the header type may be one ofillumina
,solexa
,sra
,454
,presto
.
CollapseSeq:
- Sequences with a missing character count exceeding the user limit defined
by
-n maximum_missing_count
are now exported to a separatecollapse-undetermined
output file, rather than included in thecollapse-unique
sequence output.
EstimateError:
- Now outputs error estimations for positions, quality scores, nucleotide pairs, and annotation sets.
- Machine reported quality scores and empirical quality scores have been added to all output tables.
FilterSeq:
- Added
length
subcommand to filter sequences by minimum length.
PairSeq:
- Coordinate information has been redefined as per AssemblePairs.
ParseHeaders:
- Added new subcommand
convert
which attempts to reformat sequence headers into the pRESTO format. - The
rename
subcommand will now append entries if the new field name already exists in the sequence header, rather than replace the entry.
Version 0.3 (prerelease 6): August 13, 2013¶
Toolkit is now dependent upon pandas 0.12 for the estimateError tool.
alignSets:
- Changed MUSCLE execution to faster settings (
-diags
,-maxiters 2
).
filterQuality:
- Added
repeat
subcommand to filter sequences with-n (value)
repetitions of a single character and. - Changed
-n
parameter ofambig
subcommand from fractional value to a raw count.
estimateError:
- New tool which estimates error of sequence sets by comparison to a consensus.
maskPrimers:
- Bug fixes to alignment position calculation of
align
subcommand when primer alignment begins before start of sequence. - Removed
--ann
parameter.
Version 0.3 (prerelease 5): August 7, 2013¶
License changed to Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
IgPipeline Core:
- Bug fixes to diversity calculation.
- Added support for files where all sequences do not share the same annotation fields.
- Added support for alternate scoring of gap and N-valued nucleotides.
alignSets:
- Added
--mode
parameter with options ofpad
andcut
to specify whether to extend or trim read groups to the same start position. - Fixed intermittent ‘muscle’ subcommand stdout pipe deadlock when executing MUSCLE.
assemblePairs:
- Added
join
subcommand to support library preps where paired-end reads do not overlap. - Speed improvements to p-value calculations.
buildConsensus:
--div
parameter converted to--maxdiv value
to allow filtering of read groups by diversity.- Bug fixes to nucleotide frequency consensus method.
-q
parameter renamed to--qual
.
collapseSequences:
- Added support for files where all sequences do not share the same annotation fields.
splitSeqFile:
samplepair
subcommand added to allow random sampling from paired-end file sets.- The behavior of the
-c
parameter of thesample
andsamplepair
subcommands changed to allow multiple samplings with the same command.
Version 0.3 (prerelease 4): May 18, 2013¶
Initial public prerelease
Contact¶
If you have questions you can email the Immcantation Group.
If you’ve discovered a bug or have a feature request, you can create an issue on Bitbucket using the Issue Tracker.
Citation¶
To cite pRESTO in publications please use:
Vander Heiden JA*, Yaari G*, Uduman M, Stern JNH, O’Connor KC, Hafler DA, Vigneault F, Kleinstein SH. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 2014; doi: 10.1093/bioinformatics/btu138
License¶
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.