pRESTO - The REpertoire Sequencing TOolkit

pRESTO is a toolkit for processing raw reads from high-throughput sequencing of B cell and T cell repertoires.

Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of lymphocyte repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. The REpertoire Sequencing TOolkit (pRESTO) is composed of a suite of utilities to handle all stages of sequence processing prior to germline segment assignment. pRESTO is designed to handle either single reads or paired-end reads. It includes features for quality control, primer masking, annotation of reads with sequence embedded barcodes, generation of unique molecular identifier (UMI) consensus sequences, assembly of paired-end reads and identification of duplicate sequences. Numerous options for sequence sorting, sampling and conversion operations are also included.

Overview

Scope and Features

pRESTO performs all stages of raw sequence processing prior to alignment against reference germline sequences. The toolkit is intended to be easy to use, but some familiarity with commandline applications is expected. Rather than providing a fixed solution to a small number of common workflows, we have designed pRESTO to be as flexible as possible. This design philosophy makes pRESTO suitable for many existing protocols and adaptable to future technologies, but requires users to construct a sequence of commands and options specific to their experimental protocol.

pRESTO is composed of a set of standalone tools to perform specific tasks, often with a series of subcommands providing different behaviors. A brief description of each tool is shown in the table below.

Tool Subcommand Description
AlignSets   Multiple aligns sets of sequences sharing the same annotation
  muscle Uses the program MUSCLE to align reads
  offset Uses a table of primer alignments to align the 5’ region
  table Creates a table of primer alignments for the offset subcommand
AssemblePairs   Assembles paired-end reads into a complete sequence
  align Assembles paired-end reads by aligning the sequence ends
  join Concatenates pair-end reads with intervening gaps
  reference Assembles paired-end reads using V-segment references
  sequential Attempt alignment assembly followed by reference assembly
BuildConsensus   Constructs UMI consensus sequences
ClusterSets   Clusters read groups
  all Cluster all sequences regardless of annotation
  barcode Cluster reads by clustering barcode sequences
  set Cluster reads by sequence data within barcode groups
CollapseSeq   Removes duplicate sequences
ConvertHeaders   Converts sequence headers to the pRESTO format
  generic Converts sequence headers with an unknown annotation system
  454 Converts Roche 454 sequence headers
  genbank Converts NCBI GenBank and RefSeq sequence headers
  illumina Converts Illumina sequence headers
  imgt Converts sequence headers output by IMGT/GENE-DB
  migec Converts sequence headers output by MIGEC
  sra Converts NCBI SRA or EMBL-EBI ENA sequence headers
EstimateError   Estimates error rates for UMI data
FilterSeq   Removes or modifies low quality reads
  length Removes sequences under a defined length
  maskqual Masks low Phred quality score positions with Ns
  missing Removes sequences with a high number of Ns
  repeats Removes sequences with long repeats of a single nucleotide
  quality Removes sequences with low Phred quality scores
  trimqual Trims sequences to segments with high Phred quality scores
MaskPrimers   Identifies and removes primer regions, MIDs and UMI barcodes
  align Matches primers by local alignment and reorients sequences
  score Matches primers at a fixed user-defined start position
PairSeq   Sorts paired-end reads and copies annotations between them
ParseHeaders   Manipulates sequence annotations
  add Adds a field and value annotation pair to all reads
  collapse Compresses a set of annotation fields into a single field
  copy Copies values between annotations fields
  delete Deletes an annotation from all reads
  expand Expands an field with multiple values into separate annotations
  merge Merge multiple annotations fields into a single field
  rename Rename annotation fields
  table Outputs sequence annotations as a data table
ParseLog   Converts the log output of pRESTO scripts into data tables
SplitSeq   Performs conversion, sorting, and subsetting of sequence files
  count Splits files into smaller files
  group Splits files based on numerical or categorical annotation
  sample Randomly samples sequences from a file
  samplepair Randomly samples paired-end reads from two files
  select Filters sequences based on annotations
  sort Sorts sequences based on annotations
UnifyHeaders   Unifies annotation fields based on grouping scheme
  consensus Reassign fields to consensus values
  delete Delete sequences with differing field values.

Input and Output

All tools take as input standard FASTA or FASTQ formatted files and output files in the same formats. This allows pRESTO to work seamlessly with other sequence processing tools that use either of these data formats; any steps within a pRESTO workflow can be exchanged for an alternate tool, if desired.

Each tool appends a specific suffix to its output files describing the step and output. For example, MaskPrimers will append _primers-pass to the output file containing successfully aligned sequences and _primers-fail to the file containing unaligned sequences.

See also

Details regarding the suffixes used by pRESTO tools can be found in the Commandline Usage documentation for each tool.

Annotation Scheme

The majority of pRESTO tools manipulate and add sequences-specific annotations as part of their processing functions using the scheme shown below. Each annotation is delimited using a reserved character (| by default), with the annotation field name and values separated by a second reserved character (= by default), and each value within a field is separated by a third reserved character (, by default). These annotations follow the sequence identifier, which itself immediately follows the > (FASTA) or @ (FASTQ) symbol denoting the beginning of a new sequence entry. The sequence identifier is given the reserved field name ID. To mitigate potential analysis errors, each tool in pRESTO annotates sequences by appending values to existing annotation fields when they exist, and will not overwrite or delete annotations unless explicitly performed using the ParseHeaders tool. All reserved characters can be redefined using the command line options.

FASTA Annotation
>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
FASTQ Annotation
@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
+
!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo

See also

  • Details regarding the annotations added by pRESTO tools can be found in the Commandline Usage documentation for each tool.
  • The ParseHeaders tool provides a number of options for manipulating annotations in the pRESTO format.
  • The ConvertHeaders tool allows you convert several common annotation schemes into the pRESTO annotation format.

Download

The latest stable release of pRESTO may be downloaded from PyPI or Bitbucket.

Development versions and source code are available on Bitbucket.

Installation

The simplest way to install the latest stable release of pRESTO is via pip:

> pip3 install presto --user

The current development build can be installed using pip and mercurial in similar fashion:

> pip3 install hg+https://bitbucket.org/kleinstein/presto#default --user

If you currently have a development version installed, then you will likely need to add the arguments --upgrade --no-deps --force-reinstall to the pip3 command.

Requirements

Linux

  1. The simplest way to install all Python dependencies is to install the full SciPy stack using the instructions, then install Biopython according to its instructions.

  2. Download the pRESTO bundle and run:

    > pip3 install presto-x.y.z.tar.gz --user
    

Mac OS X

  1. Install Xcode. Available from the Apple store or developer downloads.

  2. Older versions Mac OS X will require you to install XQuartz 2.7.5. Available from the XQuartz project.

  3. Install Homebrew following the installation and post-installation instructions.

  4. Install Python 3.4.0+ and set the path to the python3 executable:

    > brew install python3
    > echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile
    
  5. Exit and reopen the terminal application so the PATH setting takes effect.

  6. You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hour to install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it using Homebrew:

    > brew install gfortran
    

    If the above fails run this instead:

    > brew install --env=std gfortran
    
  7. Install NumPy, SciPy, pandas and Biopyton using the Python package manager:

    > pip3 install numpy scipy pandas biopython
    
  8. Download pRESTO bundle, open a terminal window, change directories to download location, and run:

    > pip3 install presto-x.y.z.tar.gz
    

Windows

  1. Install Python 3.4.0+ from Python, selecting both the options ‘pip’ and ‘Add python.exe to Path’.

  2. Install NumPy, SciPy, pandas and Biopython using the packages available from the Unofficial Windows binary collection.

  3. Download pRESTO bundle, open a Command Prompt, change directories to the download folder, and run:

    > pip install presto-x.y.z.tar.gz
    
  4. For a default installation of Python 3.4, the pRESTO scripts will be installed into C:\Python34\Scripts and should be directly executable from the Command Prompt. If this is not the case, then follow step 5 below.

  5. Add both the C:\Python34 and C:\Python34\Scripts directories to your %Path%. On Windows 7 the %Path% setting is located under Control Panel -> System and Security -> System -> Advanced System Settings -> Environment variables -> System variables -> Path.

  6. If you have trouble with the .py file associations, try adding .PY to your PATHEXT environment variable. Also, opening a command prompt as Administrator and run:

    > assoc .py=Python.File
    > ftype Python.File="C:\Python34\python.exe" "%1" %*
    

Workflows

Roche 454 BCR mRNA with Multiplexed Samples

Overview of Experimental Data

This example uses publicly available data from:

Lineage structure of the human antibody repertoire in response to influenza vaccination.
Jiang N, He J, and Weinstein JA, et al.
Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.

Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this example, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump -X 50000 SRR765688

Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available in the published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverse primers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.

Read Configuration
_images/Jiang2013_ReadConfiguration.svg

Schematic of the Roche 454 read configuration. The start of each sequences is labeled with the sample barcode. The following sequence may either be in the forward (top) orientation, proceeding 5’ to 3’ in the direction of the V(D)J reading frame, or the reverse complement orientation (bottom), proceeding in the opposite direction.

Example Data

We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, with accompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloaded from here:

Jiang, He and Weinstein et al, 2013 Example Files

Overview of the Workflow

The example that follows performs all processing steps to arrive at high-quality unique sequences using this example data set. The workflow is derived into three high level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart
_images/Jiang2013_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) quality control, (green) sample barcode and primer identification, (blue) deduplication and filtering of the repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env bash
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
ParseLog.py -l FSL.log -f ID LENGTH
ParseLog.py -l FSQ.log -f ID QUALITY
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Download Commands

Quality control

The initial stage of the workflow involves two executions of the FilterSeq tool. First, the length subcommand is used to filter reads which are too short to yield full V(D)J sequences using a liberal minimum length requirement of 300 bp (-n 300):

2
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log

Next, the quality subcommand removes sequences having a mean Phred quality score below 20 (-q 20).

3
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log

The ParseLog tool is then used to extract the results from the two FilterSeq log files:

17
ParseLog.py -l FSL.log -f ID LENGTH

To create two tab-delimited files containing the following results for each read:

Field Description
ID Sequence name
LENGTH Sequence length
QUALITY Quality score

Sample barcode and primer identification

Annotation of sample barcodes

Following the initial filtering steps, additional filtering is performed with three iterations of the MaskPrimers tool based upon the presence of recognized sample barcode (MID), forward primer, and reverse primer sequences. As the orientation and position of the sample barcode is known, the first pass through MaskPrimers uses the faster score subcommand which requires a fixed start position (--start 0) and a low allowable error rate (--maxerror 0.1). The name of the MID will be annotated in the sequence header as the MID field (--pf MID):

4
5
6
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log
Primer masking and annotation

The next MaskPrimers task uses the align subcommand to identify both the start position of the V-segment primer and correct the orientation of the sequence such that all reads are now oriented in the direction of the V(D)J reading frame), as determined by the orientation of the V-segment primer match. The name of the V-segment primer will be annotated in the sequence header as the VPRIMER field (--pf VPRIMER):

7
8
9
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log

The final MaskPrimers task locates the C-region primer, which is used for isotype assignment of each read. As all sequences are assumed to have been properly oriented by the second MaskPrimers task, the additional arguments MaskPrimers align --revpr and MaskPrimers align --skiprc are added to the third execution. The MaskPrimers align --revpr argument informs the tool that primers sequences should be reverse complemented prior to alignment, and that a match should be searched for (and cut from) the tail end of the sequence. The MaskPrimers align --skiprc argument tells the tool to align against only the forward sequence; meaning, it will not check primer matches against the reverse complement sequence and it will not reorient sequences. The name of the C-region primer will be annotated in the sequence header as the CPRIMER field (--pf CPRIMER):

10
11
12
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log

At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file of each MaskPrimers tasks:

19
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Which will contain the following information for each log file:

Field Description
ID Sequence name
PRIMER Primer or sample barcode name
ERROR Primer match error rate

Deduplication and filtering

Removal of duplicate sequences

The final stage of the workflow involves two filtering steps to yield unique sequences for each sample barcode. First, the set of unique sequences is identified using the CollapseSeq tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicated share the same isotype and sample barcode tag (--uf MID CPRIMER). Additionally, the V-region primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

13
14
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43
Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

15
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
Creating an annotation table

Finally, the annotations, including the sample barcode (MID), duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-region primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimited file using the table subcommand of ParseHeaders:

16
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER

Note

Optionally, you may split each sample into separate files using the MID annotation and an alternate invocation of SplitSeq. The group subcommand may be used to split files on a categorical field, rather than a numerical field, by skipping the --num argument:

SplitSeq.py group -s M1_collapse-unique.fastq -f MID

Will split the unique sequence file into a set of separate files according the the valud in the MID field (-f MID), such that each file will contain sequences from only one sample.

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File Description
S43_collapse-unique.fastq Total unique sequences
S43_atleast-2.fastq Unique sequences represented by at least 2 reads
S43_atleast-2_headers.tab Annotation table of the atleast-2 file
FSL_table.tab Table of the FilterSeq-length log
FSQ_table.tab Table of the FilterSeq-quality log
MPM_table.tab Table of the MID MaskPrimers log
MPV_table.tab Table of the V-segment MaskPrimers log
MPC_table.tab Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Performance

Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 sample barcodes, 11 V-segment primers, and 5 C-region primers.

Line Tool Reads Cores MB Minutes
01 FilterSeq.py length 1,346,039 10 669 12.2
02 FilterSeq.py quality 1,184,905 10 619 11.5
03 MaskPrimers.py score 1,184,219 10 621 18.5
04 MaskPrimers.py align 1,177,245 10 618 263.1
05 MaskPrimers.py align 959,920 10 563 122.9
07 ParseHeaders.py expand 548,658 1 329 3.6
08 ParseHeaders.py rename 548,658 1 329 4.1
09 CollapseSeq.py 548,658 1 1,509 5.6

Illumina MiSeq 2x250 BCR mRNA

Overview of Experimental Data

This example uses publicly available data from:

Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice.
Greiff, V. et al.
BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.

Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For this example, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which may downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 ERR346600

Primers sequences are available in additional file 1 of the publication.

Read Configuration
_images/Greiff2014_ReadConfiguration.svg

Schematic of Illumina MiSeq 2x250 stranded paired-end reads without UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains the C-region primer sequence. Both reads begin with a random sequences of 4 nucleotides.

Example Data

We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format, with accompanying primer files and an example workflow script. The sample data set and workflow script may be downloaded from here:

Greiff et al, 2014 Example Files

Overview of the Workflow

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart
_images/Greiff2014_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) paired-end assembly, (green) quality control and primer annotation, and deduplication and filtering (blue). The intermediate files output by each tool are not shown for the sake of brevity.

Commands
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env bash
AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log
FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log
MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log
CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1
SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
ParseLog.py -l FS.log -f ID QUALITY
ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Download Commands

Paired-end assembly

Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand of the AssemblePairs tool:

2
3
AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log

During assembly we have defined read 2 (V-region) as the head of the sequence (-1) and read 1 as the tail of the sequence (-2). The --coord argument defines the format of the sequence header so that AssemblePairs can properly identify mate-pairs; in this case, we use --coord sra as our headers are in the SRA/ENA format.

Note

For both the AssemblePairs and PairSeq commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then the appropriate argument would be --coord illumina.

The ParseLog tool is then used to build a tab-delimited file of results from the AssemblePairs log file:

13
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE

Which will containing the following columns:

Field Description
ID Sequence name
LENGTH Length of the assembled sequence
OVERLAP Length of the overlap between mate-pairs
ERROR Mismatch rate of the overlapping region
PVALUE P-value for the assembly

See also

Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.

Quality control and primer annotation

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

4
FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log

The ParseLog tool is then used to build tab-delimited file from the FilterSeq log:

14
ParseLog.py -l FS.log -f ID QUALITY

Capturing the following annotations:

Field Description
ID Sequence name
QUALITY Quality score
Read annotation and masking of primer regions

When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The score subcommand of MaskPrimers is used to identify and remove the V-segment and C-region PCR primers for both reads:

5
6
7
8
MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log

In this data set the authors have added a random sequence of 4 bp to the start of each read before the primer sequence to increase sequence diversity and the reliability of cluster calling on the Illumina platform. As such, both primers begin at position 4 (--start 4), but the C-region primer begins 4 bases from the end of the assembled read. The addition of the --revpr argument to the second MaskPrimers step instructs the tool to reverse complement the primer sequences and check the tail of the read. The two primer regions have also been treated differently. The V-segment primer has been masked (replaced by Ns) using the --mode mask argument to preserve the V(D)J length, while the C-region primer has been removed from the sequence using the --mode cut argument.

During each iteration of the MaskPrimers tool an annotation field was added with the name of the primer, with the name of the field specified by the --pf argument, such that after both iterations each sequence contains an annotation of the form:

VPRIMER=V-segment primer|CPRIMER=C-region primer

To summarize these steps, the ParseLog tool is used to build tab-delimited files from the two MaskPrimers logs:

15
ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Capturing the following annotations:

Field Description
ID Sequence name
PRIMER Primer name
ERROR Primer match error rate

Note

This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read 1 is always the C-region end of the amplicon and read 2 is always the V-segment end. If your data is unstranded (50% of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers step to account for this by using the align subcommand instead:

MaskPrimers.py align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --maxlen 30 --mode mask --pf VPRIMER --log MP1.log

This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read for matches, and correcting the the output sequences to the forward orientation (V to J).

Deduplication and filtering

Removal of duplicate sequences

The last stage of the workflow involves two filtering steps to yield the final repertoire. First, the set of unique sequences is identified using the CollapseSeq tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicates share the same C-region primer annotation (--uf CPRIMER). Additionally, the V-segment primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

 9
10
CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1
Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

11
SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
Creating an annotation table

Finally, the annotations, including duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-segment primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq output into a tab-delimited file using the table subcommand of ParseHeaders:

12
ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File Description
M1_collapse-unique.fastq Total unique sequences
M1_atleast-2.fastq Unique sequences represented by at least 2 reads
M1_atleast-2_headers.tab Annotation table of the atleast-2 file
AP_table.tab Table of the AssemblePairs log
FS_table.tab Table of the FilterSeq log
MPV_table.tab Table of the V-segment MaskPrimers log
MPC_table.tab Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

UMI Barcoded Illumina MiSeq 2x250 BCR mRNA

Overview of Experimental Data

This example uses publicly available data from:

B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes.
Stern JNH, Yaari G, and Vander Heiden JA, et al.
Sci Transl Med. 2014. 6(248):248ra107. doi:10.1126/scitranslmed.3008879.

Which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample M12 (accession: SRR1383456), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 SRR1383456

Primers sequences are available online at the supplemental website for the publication.

Read Configuration
_images/Stern2014_ReadConfiguration.svg

Schematic of the Illumina MiSeq 2x250 paired-end reads with UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains a 15 nucleotide UMI barcode preceding the C-region primer sequence.

Example Data

We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:

Stern, Yaari and Vander Heiden et al, 2014 Example Files

Overview of the Workflow

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart
_images/Stern2014_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into four primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensus sequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtain the high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env bash
FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log
MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
    --start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
    --start 0 --mode mask --outname MS12_R2 --log MP2.log
PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
    --1f BARCODE --coord sra
BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log
PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
    --coord presto
AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
    -2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
    --1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log
ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min
CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
    --cf CONSCOUNT --act sum --outname MS12
SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12
ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT
ParseLog.py -l FS1.log FS2.log -f ID QUALITY
ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \
    PRFREQ ERROR
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE FIELDS1 FIELDS2

Download Commands

Quality control, UMI annotation and primer masking

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

2
3
FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log

The ParseLog tool is then used to extract results from the FilterSeq logs into tab-delimited files:

24
ParseLog.py -l FS1.log FS2.log -f ID QUALITY

Extracting the following information from the log:

Field Description
ID Sequence name
QUALITY Quality score
UMI annotation and masking of primer regions

Next, the score subcommand of MaskPrimers is used to identify and remove the PCR primers for both reads. When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The MaskPrimers tool is also used to annotate each read 1 sequence with the 15 nucleotide UMI that precedes the C-region primer (MaskPrimers score --barcode):

4
5
6
7
MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
    --start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
    --start 0 --mode mask --outname MS12_R2 --log MP2.log

To summarize these steps, the ParseLog tool is used to build a tab-delimited file from the MaskPrimers log:

25
ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR

Containing the following information:

Field Description
ID Sequence name
PRIMER Primer name
BARCODE UMI sequence
ERROR Primer match error rate

Note

For this data set the UMI is immediately upstream of the C-region primer. Another common approach for UMI barcoding involves placing the UMI immediately upstream of a 5’RACE template switch site. Modifying the workflow is simple for this case. You just need to replace the V-segment primers with a fasta file containing the TS sequences and move the --barcode argument to the appropriate read:

MaskPrimers.py score -s R1_quality-pass.fastq -p CPrimers.fasta \
    --start 0 --mode cut --outname R1 --log MP1.log
MaskPrimers.py score -s R2_quality-pass.fastq -p TSSites.fasta \
    --start 17 --barcode --mode cut --maxerror 0.5 \
    --outname R2 --log MP2.log

In the above we have moved the UMI annotation to read 2, increased the allowable error rate for matching the TS site (--maxerror 0.5), cut the TS site (--mode cut), and increased the size of the UMI from 15 to 17 nucleotides (--start 17).

Generation of UMI consensus sequences

Copying the UMI annotation across paired-end files

In this task, a single consensus sequence is constructed for each set of reads annotated with the same UMI barcode. As the UMI barcode is part of read 1, the BARCODE annotation identified by MaskPrimers must first be copied to the read 2 mate-pair of each read 1 sequence. Propogation of annotations between mate pairs is performed using PairSeq which also removes unpaired reads and ensures that paired reads are sorted in the same order across files:

8
9
PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
    --1f BARCODE --coord sra

Note

For both the PairSeq and AssemblePairs commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then the appropriate argument would be --coord illumina.

Note

If you have followed the 5’RACE modification above, then you must also modify the first PairSeq step to copy the UMI from read 2 to read 1, instead of vice versa (--2f BARCODE):

PairSeq.py -1 R1_primers-pass.fastq -2 R2_primers-pass.fastq \
    --2f BARCODE --coord sra
Multiple alignment of UMI read groups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar. This type of primer misalignment can be corrected using the AlignSets tool. In the example data used here, this step was not necessary due to the aligned primer design for the 45 V-segment primers, though this does require that the V-segment primers be masked, rather than cut, during the MaskPrimers step (--mode mask).

See also

If your data requires alignment, then you can create multiple aligned UMI read groups as follows:

AlignSets.py muscle -s R1_primers-pass_pair-pass.fastq --bf BARCODE \
    --exec ~/bin/muscle --outname R1 --log AS1.log
AlignSets.py muscle -s R2_primers-pass_pair-pass.fastq --bf BARCODE \
    --exec ~/bin/muscle --outname R2 --log AS2.log

Where the --bf BARCODE defines the field containing the UMI and --exec ~/bin/muscle is the location of the MUSCLE executable.

For additional details see the section on fixing UMI alignments.

Generating UMI consensus reads

After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus:

10
11
12
13
BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log

To correct for UMI chemistry and sequencing errors, UMI read groups having high error statistics (mismatch rate from consensus) are removed by specifiying the --maxerror 0.1 threshold. As the accuracy of the primer assignment in read 1 is critical for correct isotype identification, additional filtering of read 1 is carried out during this step. Specifying the --prcons 0.6 threshold: (a) removes individual sequences that do not share a common primer annotation with the majority of the set, (b) removes entire read groups which have ambiguous primer assignments, and (c) constructs a consensus primer assignment for each UMI.

Note

The --maxgap 0.5 argument tells BuildConsensus to use a majority rule to delete any gap positions which occur in more than 50% of the reads. The --maxgap argument is not really necessary for this example data set as we did not perform a multiple alignment of the UMI read groups. However, if you have performed an alignment, then use of --maxgap during consensus generation is highly recommended.

The ParseLog tool is then used to build a tab-delimited file contain the consensus results:

26
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \

With the following annotations:

Field Description
BARCODE UMI sequence
SEQCOUNT Number of total reads in the UMI group
CONSCOUNT Number of reads used for the UMI consensus
PRIMER Set of primer names in the UMI group
PRCONS Consensus primer name
PRCOUNT Count of primers in the UMI group
PRFREQ Frequency of primers in the UMI group
ERROR Average mismatch rate from consensus

Paired-end assembly of UMI consensus sequences

Syncronizing paired-end files

Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus. To synchronize the reads another instance of PairSeq must be run, but without any annotation manipulation:

14
15
PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
    --coord presto
Assembling UMI consensus mate-pairs

Once the files have been synchronized, each paired-end UMI consensus sequence is assembled into a full length Ig sequence using the align subcommand of AssemblePairs:

16
17
18
AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
    -2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
    --1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log

During assembly, the consensus isotype annotation (PRCONS) from read 1 and the number of reads used to define the consensus sequence (CONSCOUNT) for both reads are propagated into the annotations of the full length Ig sequence (--1f CONSCOUNT --2f CONSCOUNT PRCONS.

ParseLog is then uses to extract the results from the AssemblePairs log into a tab-delimited file:

27
    PRFREQ ERROR

Containing the following information:

Field Description
ID Sequence name (UMI)
LENGTH Length of the assembled sequence
OVERLAP Length of the overlap between mate-pairs
ERROR Mismatch rate of the overlapping region
PVALUE P-value for the assembly
FIELDS1 Annotations copied from read 2 into the assembled sequence
FIELDS2 Annotations copied from read 1 into the assembled sequence

See also

Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.

Deduplication and filtering

Combining UMI read group size annotations

In the final stage of the workflow, the high-fidelity Ig repertoire is obtained by a series of filtering steps. First, the annotation specifying the number of raw reads used to build each sequence (-f CONSCOUNT) is updated to be the minimum (--act min) of the forward and reverse reads using the collapse subcommand of ParseHeaders:

19
ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min
Removal of duplicate sequences

Second, duplicate nucleotide sequences are removed using the CollapseSeq tool with the requirement that duplicate sequences share the same isotype primer (--uf PRCONS). The duplicate removal step also removes sequences with a high number of interior N-valued nucleotides (-n 20 and --inner) and combines the read counts for each UMI read group (--cf CONSCOUNT and --act sum).

20
21
CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
    --cf CONSCOUNT --act sum --outname MS12
Filtering to sequences with at least two representative reads

Finally, unique sequences are filtered to those with at least 2 contributing sequences using the group subcommand of SplitSeq, by splitting the file on the CONSCOUNT annotation with a numeric threshold (-f CONSCOUNT and --num 2):

22
SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12
Creating an annotation table

For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders:

23
ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File Description
M12_collapse-unique.fastq Total unique sequences
M12_atleast-2.fastq Unique sequences represented by at least 2 reads
M12_atleast-2_headers.tab Annotation table of the atleast-2 file
FS1_table.tab Table of the read 1 FilterSeq log
FS2_table.tab Table of the read 2 FilterSeq log
MP1_table.tab Table of the C-region MaskPrimers log
MP2_table.tab Table of the V-segment MaskPrimers log
BC1_table.tab Table of the read 1 BuildConsensus log
BC2_table.tab Table of the read 2 BuildConsensus log
AP_table.tab Table of the AssemblePairs log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Performance

Example performance statistics for a comparable, but larger, MiSeq workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,723,558 x 2 raw reads, and required matching of 1 C-region primer, 45 V-segment primers, and averaged 24.3 reads per UMI.

Line Tool Reads Cores MB Minutes
01 R1: FilterSeq.py quality 1,723,558 10 1,219 13.0
02 R2: FilterSeq.py quality 1,723,558 10 1,219 12.8
03 R1: MaskPrimers.py score 1,722,116 10 1,221 18.9
04 R2: MaskPrimers.py score 1,684,050 10 1,221 46.7
06 PairSeq.py 1,665,584 1 4,734 25.5
07 R1: BuildConsensus.py 1,565,017 10 1,228 50.1
08 R2: BuildConsensus.py 1,565,017 10 1,229 58.6
11 AssemblePairs.py align 66,285 10 358 3.0
13 ParseHeaders.py collapse 56,104 1 88 0.4
14 CollapseSeq.py 55,480 1 822 0.7
15 SplitSeq.py group 51,047 1 88 0.2

UMI Barcoded Illumina MiSeq 325+275 paired-end 5’RACE BCR mRNA

Overview of Experimental Data

This example uses publicly available data from:

Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing.
Vander Heiden JA, et al.
J Immunol. 2017 198(4):1460-1473. doi:10.4049/jimmunol.1601415.

Which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample HD09_N_AB8KB (accession: SRR4026043), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 SRR4026043

Primers sequences are available online from the protocols/AbSeq directory in the Immcantation repository.

Read Configuration
_images/VanderHeiden2017_ReadConfiguration.svg

Schematic of the Illumina MiSeq 325+275 paired-end reads with UMI barcodes. Each read was sequenced from one end of the target cDNA so that the two reads together cover the entire variable region of the Ig heavy chain. Sequencing was performed using an uneven number of cycles for read 1 and read 2 using a 2x300 kit. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), contains a partial C-region, and is 325 nucletoides in length. Read 2 contains the 5’RACE template switch site with a 17 nucleotide UMI barcode preceding it, and is 275 nucleotides in length.

Example Data

We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:

Vander Heiden et al, 2017 Example Files

Overview of the Workflow

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart
_images/VanderHeiden2017_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into four primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensus sequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtain the high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env bash
FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log
MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
    --start 0 --mode cut --outname HD09N-R1 --log MP1.log
MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
    --start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log
PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
    --2f BARCODE --coord sra
BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
    --maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log
PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
    --coord presto
AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
    -2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
    --coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
    --aligner blastn --outname HD09N-C --log AP.log
MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
    -p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
    --mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log 
ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min
CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
    --uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C
SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
    -f CONSCOUNT --num 2 --outname HD09N-C
ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT
ParseLog.py -l FS1.log FS2.log -f ID QUALITY
ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR
ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY

Download Commands

Quality control, UMI annotation and primer masking

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

2
3
FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log

The ParseLog tool is then used to extract results from the FilterSeq logs into tab-delimited files:

29
ParseLog.py -l FS1.log FS2.log -f ID QUALITY

Extracting the following information from the log:

Field Description
ID Sequence name
QUALITY Quality score
UMI annotation and removal of primer regions

Next, the score subcommand of MaskPrimers is used to identify and remove the PCR primers for both reads. When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The MaskPrimers tool is also used to annotate each read 2 sequence with the 17 nucleotide UMI that precedes the template switch site. (MaskPrimers score --barcode):

4
5
6
7
MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
    --start 0 --mode cut --outname HD09N-R1 --log MP1.log
MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
    --start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log

To summarize these steps, the ParseLog tool is used to build a tab-delimited file from the MaskPrimers log:

30
ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR

Containing the following information:

Field Description
ID Sequence name
PRIMER Primer name
BARCODE UMI sequence
ERROR Primer match error rate

Note

For this data, we are using the 5’RACE template switch sequences as proxy primers. We’ve set the match error rate extremely high (--maxerror 0.5) because an accurate match isn’t important. Mostly, we are just concerned with extracting the UMI barcode that precedes the template switch site.

Generation of UMI consensus sequences

Copying the UMI annotation across paired-end files

In this task, a single consensus sequence is constructed for each set of reads annotated with the same UMI barcode. As the UMI barcode is part of read 2, the BARCODE annotation identified by MaskPrimers must first be copied to the read 1 mate-pair of each read 2 sequence. Propogation of annotations between mate pairs is performed using PairSeq which also removes unpaired reads and ensures that paired reads are sorted in the same order across files:

8
9
PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
    --2f BARCODE --coord sra

Note

For both the PairSeq and AssemblePairs commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA, then the appropriate argument would be --coord illumina.

Multiple alignment of UMI read groups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar. This type of primer misalignment can be corrected using the AlignSets tool.

This step is not required for this data, but if necessary for your data see the section on fixing UMI problems.

Generating UMI consensus reads

After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus:

10
11
12
13
BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
    --maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log

To correct for UMI chemistry and sequencing errors, UMI read groups having high error statistics (mismatch rate from consensus) are removed by specifiying the --maxerror 0.1 threshold. Additional filtering of read 1 is carried out during this step by specifying the --prcons 0.6 threshold which: (a) removes individual sequences that do not share a common primer annotation with the majority of the set, (b) removes entire read groups which have ambiguous primer assignments, and (c) constructs a consensus primer assignment for each UMI.

Note

The --maxgap 0.5 argument tells BuildConsensus to use a majority rule to delete any gap positions which occur in more than 50% of the reads. The --maxgap argument is not really necessary for this example data set as we did not perform a multiple alignment of the UMI read groups. However, if you have performed an alignment, then use of --maxgap during consensus generation is highly recommended.

The ParseLog tool is then used to build a tab-delimited file contain the consensus results:

31
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR

With the following annotations:

Field Description
BARCODE UMI sequence
SEQCOUNT Number of total reads in the UMI group
CONSCOUNT Number of reads used for the UMI consensus
PRCONS Consensus primer name
PRFREQ Frequency of primers in the UMI group
ERROR Average mismatch rate from consensus

Paired-end assembly of UMI consensus sequences

Syncronizing paired-end files

Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus. To synchronize the reads another instance of PairSeq must be run, but without any annotation manipulation:

14
15
PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
    --coord presto
Assembling UMI consensus mate-pairs

Once the files have been synchronized, each paired-end UMI consensus sequence is assembled into a full length Ig sequence using the sequential subcommand of AssemblePairs. 5’RACE creates long amplicons which are not guaranteed to overlap when sequenced using a 2x300 kit. The sequential subcommand will first attempt de novo paired-end assembly. If that approach fails, it will attempt assembly guided by the V segment reference sequences specified by the -r argument. Mate-pairs that fail to overlap can thus be assembled, with the full length sequence containing an appropriate number of intervening between the two reads.

16
17
18
19
AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
    -2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
    --coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
    --aligner blastn --outname HD09N-C --log AP.log

During assembly, the consensus primer annotation (PRCONS) from read 1 and the number of reads used to define the consensus sequence (CONSCOUNT) for both reads are propagated into the annotations of the full length Ig sequence (--1f CONSCOUNT --2f CONSCOUNT PRCONS.

ParseLog is then uses to extract the results from the AssemblePairs log into a tab-delimited file:

32
ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY

Containing the following information:

ID REFID LENGTH OVERLAP GAP ERROR IDENTITY

Field Description
ID Sequence name (UMI)
REFID The reference sequence name for reference guided assembly
LENGTH Length of the assembled sequence
OVERLAP Length of the overlap between mate-pairs
GAP Length of the gap between mate-pairs
ERROR Mismatch rate of the overlapping region from de novo assembly
IDENTITY Identity score for the aligned region in reference guided assembly

Deduplication and filtering

Annotating sequences with the constant region

In the final stage of the workflow, the high-fidelity Ig repertoire is obtained by a series of filtering steps. First, an additional annotation defining the constant region is added to each sequence by aligning against the constant region sequence internal to the primer. This step is not strictly necessary, as the constant region primer is usually sufficient for isotype calls. However, primer calls may be inaccurate with some primer sets, so this additional confirmatory step is beneficial.

Internal constant regions are assigned using the align subcommand of MaskPrimers with a set of manually constructed sequences as the “primers” (-p AbSeq_Human_IG_InternalCRegion.fasta):

20
21
22
MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
    -p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
    --mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log 

Here we use a high error rate for the match (--maxerror 0.3), restrict the match to 100 nucleotides at the end of the sequence (--maxlen 100 --revpr --skiprc), and use mode the argument --mode tag to specify that the input sequence should not be modified. The aligned C-region name will be added to each sequence headers in the CREGION field (--pf CREGION):

Combining UMI read group size annotations

The annotation specifying the number of raw reads used to build each sequence (-f CONSCOUNT) is updated to be the minimum (--act min) of the forward and reverse reads using the collapse subcommand of ParseHeaders:

23
ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min
Removal of duplicate sequences

Second, duplicate nucleotide sequences are removed using the CollapseSeq tool with the requirement that duplicate sequences share the same constant region annotation (--uf CREGION). The duplicate removal step also removes sequences with a high number of interior N-valued nucleotides (-n 20 and --inner) and combines the read counts for each UMI read group (--cf CONSCOUNT and --act sum).

24
25
CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
    --uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C
Filtering to sequences with at least two representative reads

Finally, unique sequences are filtered to those with at least 2 contributing sequences using the group subcommand of SplitSeq, by splitting the file on the CONSCOUNT annotation with a numeric threshold (-f CONSCOUNT and --num 2):

26
27
SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
    -f CONSCOUNT --num 2 --outname HD09N-C
Creating an annotation table

For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders:

28
ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File Description
HD09N-C_collapse-unique.fastq Total unique sequences
HD09N-C_atleast-2.fastq Unique sequences represented by at least 2 reads
HD09N-C_atleast-2_headers.tab Annotation table of the atleast-2 file
FS1_table.tab Table of the read 1 FilterSeq log
FS2_table.tab Table of the read 2 FilterSeq log
MP1_table.tab Table of the C-region primer MaskPrimers log
MP2_table.tab Table of the TS site MaskPrimers log
MP3_table.tab Table of the internal C-region MaskPrimers log
BC1_table.tab Table of the read 1 BuildConsensus log
BC2_table.tab Table of the read 2 BuildConsensus log
AP_table.tab Table of the AssemblePairs log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Fixing UMI Problems

Correcting misaligned V-segment primers and indels in UMI groups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar.

_images/Primer_Alignment.svg

Correction of misaligned sequences. (A) Discrepancies in the location of primer binding (colored bases, with primer name indicated to the left) may cause misalignment of sequences sharing a UMI. (B) Following multiple alignment of the reads the non-primer regions are correctly aligned and suitable for UMI consensus generation.

This type of primer misalignment can be corrected using one of two approaches using the AlignSets tool.

Correcting via multiple alignment

The first approach, which is conceptually simpler but computationally more expensive, is to perform a full multiple alignment of reach UMI read group using the muscle subcommand of AlignSets. The --bf BARCODE argument tells AlignSets to multiple align reads sharing the same BARCODE annotation. The --exec ~/bin/muscle is a pointer to where the MUSCLE executable is located:

AlignSets.py muscle -s reads.fastq --bf BARCODE --exec ~/bin/muscle

The above approach will also insert gaps into the sequences where an insertion/deletion has occured in the reads. As such, you will need to provide as reasonable gap character threshold to BuildConsensus, such as --maxgap 0.5, defining how you want to handle positions with gap characters when generating a UMI consensus sequence.

Note

Using the muscle subcommand, along with the --maxgap argument to BuildConsensus will also address issue with insertions/deletions in UMI read groups. Though, in UMI read groups with a sufficient number of reads consensus generation will resolve insertions/deletions without the need for multiple alignment, as any misaligned reads will simply be washed out by the majority. Whether to perform a multiple alignment prior to consensus generation is a matter of taste. A multiple alignment may improve consensus quality in small UMI read groups (eg, less than 4 sequences), but the extent to which small UMI read groups should be trusted is debatable.

Correcting via an offset table

The second approach will correct only the primer regions and will not address insertions/deletions within the sequence, but is much quicker to perform. The first step involves creation of a primer offset table using the table subcommand of AlignSets:

AlignSets.py table -p primers.fasta --exec ~/bin/muscle

Which performs a multiple alignment on sequences in primers.fasta (sequences shown in the primer alignment figure above) to generate a file containing a primer offset table:

primers_offsets-forward.tab
VP1    2
VP2    0
VP3    1

Then the offset table can be input into the offset subcommand of AlignSets to align the reads:

AlignSets.py offset -s reads.fastq -d primers_offsets-forward.tab \
    --bf BARCODE --pr VPRIMER --mode pad

In the above command we have specified the field containing the primer annotation using --pf VPRIMER and set the behavior of the tool to add gap characters to align the reads with the --mode pad argument. These options will generate the correction shown in (B) of the primer alignment figure above. Alternatively, we could have deleted unalign positions using the argument --mode cut.

Note

You may need to alter how the offset table is generated if you have used the --mode cut argument to MaskPrimers rather than --mode mask, as this will cause the ends of the primer regions, rather than the front, to be the cause of the ragged edges within the UMI read groups. For primers that have been cut you would add the --reverse argument to the table operation of AlignSets, which will create an offset table that is based on the tail end of the primers.

Dealing with insufficient UMI diversity

Due to errors in the UMI region and/or insufficient UMI length, UMI read groups are not always homogeneous with respect to the mRNA of origin. This can cause difficulties in generating a valid UMI consensus sequence. In most cases, the --prcons and --maxerror (or --maxdiv) arguments to BuildConsensus are sufficient to filter out invalid reads and/or entire invalid UMI groups. However, if there is significant nucleotide diversity within UMI groups due to insufficient UMI length or low UMI diversity, the set command of the ClusterSets tool can help correct for this. ClusterSets will cluster sequence by similarity and add an additional annotation dividing sequences within a UMI read group into sub-clusters:

ClusterSets.py set -s reads.fastq -f BARCODE -k CLUSTER --exec ~/bin/usearch

The above command will add an annotation to each sequence named CLUSTER (-k CLUSTER) containing a cluster identifier for each sequence within the UMI barcode group. The -f BARCODE argument specifies the UMI annotation and --exec ~/bin/usearch is a pointer to where the USEARCH executable is located. After assigning cluster annotations via ClusterSets, the BARCODE and CLUSTER fields can be merged using the copy operation of ParseHeaders:

ParseHeaders.py copy -s reads_cluster-pass.fastq -f BARCODE -k CLUSTER --act cat

Which will copy the UMI annotation (-f BARCODE) into the cluster annotation (-k CLUSTER) and concatenate them together (--act cat). Thus converting the annotations from:

>SEQ1|BARCODE=ATGTCG|CLUSTER=1
>SEQ2|BARCODE=ATGTCG|CLUSTER=2

To:

>SEQ1|BARCODE=ATGTCG|CLUSTER=1ATGTCG
>SEQ2|BARCODE=ATGTCG|CLUSTER=2ATGTCG

You may then specify --bf CLUSTER to BuildConsensus to tell it to generate UMI consensus sequences by UMI sub-cluster, rather than by UMI barcode annotation.

Combining split UMIs

Typically, a UMI barcode is attached to only one end of a paired-end mate-pair and can be copied to other read by a simple invocation of PairSeq. But in some cases, the UMI may be split such that there are two UMIs, each located on a different mate-pair. To deal with these sorts of UMIs, you would first employ PairSeq similarly to how you would in the single UMI case:

PairSeq.py -1 reads-1.fastq -2 reads-2.fastq --1f BARCODE --2f BARCODE \
    --coord illumina

The main difference from the single UMI case is that the BARCODE annotation is being simultaneously copied from read 1 to read 2 (--1f BARCODE) andf rom read 2 to read 1 (--2f BARCODE). This creates a set of annotations that look like:

>READ1|BARCODE=ATGTCGTT,GGCTAGTC
>READ2|BARCODE=ATGTCGTT,GGCTAGTC

Alternatively, these annotations can be combined upon copy using the --act cat argument:

PairSeq.py -1 reads-1.fastq -2 reads-2.fastq --1f BARCODE --2f BARCODE \
    --coord illumina --act cat

Which concatenates the two values in the BARCODE field, yielding UMI annotations suitable for input to BuildConsensus:

>READ1|BARCODE=ATGTCGTTGGCTAGTC
>READ2|BARCODE=ATGTCGTTGGCTAGTC

Compensating for errors in the UMI region

Depending on the protocol used for library preparation, PCR error and sequencing error can significantly affect UMI and sequence assignments. To account for this error, the following approach can be used.

Clustering UMI sequences

First, errors in the UMI region can be accounted for by reassigning UMI groups to new clusters of UMI sequence that may differ by one or more nucleotides. To identify the ideal threshold at which to cluster similar UMI sequences, EstimateError can be run on the UMI field (BARCODE):

EstimateError.py barcode -s reads.fastq -f BARCODE

The -f BARCODE defines the header annotation containing the UMI sequence. This outputs the following tables:

File Error profile
reads_distance-barcode.tab Distribution of pairwise hamming distances
reads_threshold-barcode.tab Recommended threshold

The value in the THRESHOLD column associated with the ALL row in reads_threshold-barcode.tab specifies a recommended threshold for clustering the UMI sequences.

Note

Subsampling at a depth to approximately 5,000 sequences is recommended to expedite this calculation. See the random task for an example of how to use SplitSeq to subsample sequence files.

The table specifies a threshold of 0.9 which will be used to cluster the UMI sequences via ClusterSets. The identity threshold is set via the argument --ident 0.9. Clustering will be performed on the sequences in the UMI annotation field (-f BARCODE) and UMI clusters will assigned to the annotation field INDEX_UMI via the argument -k INDEX_UMI:

ClusterSets.py barcode -s reads.fastq -f BARCODE -k INDEX_UMI --ident 0.9

Clustering V(D)J sequences

Next, sequences within these larger UMI clusters are clustered to avoid sequence collisions. Again, EstimateError is used to infer a clustering threshold, but instead of clustering UMI sequences the set subcommand is used to cluster the reads (V(D)J sequences) within the newly assigned UMI clusters (-f INDEX_UMI):

EstimateError.py set -s reads_cluster-pass.fastq -f INDEX_UMI

This outputs the following tables:

File Error profile
reads_cluster-pass_distance-set.tab Distribution of pairwise hamming distances
reads_cluster-pass_threshold-set.tab Recommended threshold

The value in the THRESHOLD column associated with the ALL row in reads_cluster-pass_threshold-set.tab specifies a recommended threshold for resolving collisions.

Note

Subsampling at a depth to approximately 5,000 sequences is recommended to expedite this calculation. See the random task for an example of how to use SplitSeq to subsample sequence files.

Using a recommended threshold of 0.8, V(D)J sequences are clustering in a similar way using the set subcommand of ClusterSets:

ClusterSets.py set -s reads_cluster-pass.fastq -f INDEX_UMI -k INDEX_SEQ --ident 0.8

Where the argument --ident 0.8 specifies the clustering threshold, -f INDEX_UMI defines the UMI cluster group to cluster within, and -k INDEX_SEQ defines the V(D)J sequence cluster annotation to add to the output headers.

Combining the UMI and V(D)J cluster annotations

Finally, new UMI groups can be generated by combining the two annotation fields generated during the clustering steps with the merge subcommand of ParseHeaders. The -f INDEX_UMI INDEX_SEQ argument defines the fields to combine and the -k INDEX_MERGE argument defines the new field that will contain the corrected UMI clusters used for consensus generation:

ParseHeaders.py merge -s reads_cluster-pass_cluster-pass.fastq -f INDEX_UMI INDEX_SEQ \
    -k INDEX_MERGE

This combined UMI-V(D)J sequence cluster annotation can then be specified as the barcode field to BuildConsensus using the --bf INDEX_MERGE argument.

Manipulating Annotations

The ParseHeaders tool provides a collection of methods for performing simple manipulations of sequence headers that are format in the pRESTO annotation scheme.

For converting sequence headers into the pRESTO format see the Import Task documentation.

Adding a sample annotation

Addition of annotation values is accomplished using the add subcommand of ParseHeaders:

ParseHeaders.py add -s reads.fastq -f SAMPLE -u A1

Which will add the annotation SAMPLE=A1 to each sequence of the input file.

Expanding and renaming annotations

By default, pRESTO will not delete annotations. If a sequence header already contains an annotation that a tool is trying to add it will not overwrite that annotation. Instead, it will append the annotation value to the values already present in a comma delimited form. For example, after two interations of MaskPrimers with the default primer field name (PRIMER), you will have an annotation in the following form, reflecting a match against primer VH3 in the first iteration and primer IGHM in the second:

PRIMER=VH3,IGHM

Separating these annotations into two annotations is accomplished via the expand subcommand of ParseHeaders:

ParseHeaders.py expand -s reads.fastq -f PRIMER

Resulting in the annotations:

PRIMER1=VH3|PRIMER2=IGHM

Which may then be renamed via the rename subcommand: expand subcommand of ParseHeaders:

ParseHeaders.py rename -s reads_reheader.fastq -f PRIMER1 PRIMER2 \
    -k VPRIMER CPRIMER

Copying, merging and collapsing annotations

Nested annotation can be generated using the copy or merge subcommands of ParseHeaders. The examples that follow will use the starting annotation:

UMI=ATGC|CELL=GGCC|COUNT=10,2

The UMI and CELL annotations can be combined into a single INDEX annotation using the following command:

ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete
# result> COUNT=10,2|INDEX=ATGC,GGCC

Without the --delete argument the original UMI and CELL annotations would be kept in the header.

The nested annotation values can then be combined using the collapse subcommand to create various effects:

ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act cat
# result> INDEX=ATGCGGCC

ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act first
# result> INDEX=ATGC

ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act sum
# result> COUNT=12

ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act min
# result> COUNT=2

Where the --act argument specifies the type of collapse action to perform.

The copy subcommand is normally used to create duplicate annotations with different names, but will have a similar effect to the merge subcommand when the target is an existing field:

ParseHeaders.py copy -s reads.fasta -f UMI -k CELL
# result> UMI=ATGC|CELL=GGCC,ATGC|COUNT=10,2

Both the copy and merge subcommands have an --act argument which allows you to perform an action from the collapse subcommand in the same step as the copy or merge:

ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete --act cat
# result> COUNT=10,2|INDEX=ATGCGGCC

ParseHeaders.py copy -s reads.fasta -f UMI -k CELL --act cat
# result> UMI=ATGC|CELL=GGCCATGC|COUNT=10,2

Deleting annotations

Unwanted annotations can be deleted using the delete subcommand of ParseHeaders:

ParseHeaders.py delete -s reads.fastq -f PRIMER

Which will remove the PRIMER field from each sequence header.

Miscellaneous Tasks

Importing data from SRA, ENA or GenBank into pRESTO

If you have download a data set from GenBank, SRA or ENA the format of the sequences headers are different from the raw Roche 454 and Illumina header format. As such, they may or may not be compatible with pRESTO, depending on how the headers have been modified by the sequence archive. The ConvertHeaders allow you to change incompatible header formats into the pRESTO format. For example, to convert from SRA or ENA headers the sra subcommand would be used:

ConvertHeaders.py sra -s reads.fastq

ConvertHeaders provides the following conversion subcommands:

Subcommand Formats Converted
generic Headers with an unknown annotation system
454 Roche 454
genbank NCBI GenBank and RefSeq
illumina Illumina HiSeq or MiSeq
imgt IMGT/GENE-DB
sra NCBI SRA or EMBL-EBI ENA

Reducing file size for submission to IMGT/HighV-QUEST

IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of SplitSeq to divide your files into small pieces.

SplitSeq.py count -s reads.fastq -n 500000 --fasta

The -n 500000 argument sets the maximum number of sequences in each file and the --fasta tells the tool to output a FASTA, rather than FASTQ, formatted file.

Note

You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using the CollapseSeq tool.

Subsetting sequence files by annotation

The group subcommand of SplitSeq allows you to split one file into multiple files based on the values in a sequence annotation. For example, splitting one file with multiple SAMPLE annotations into separate files (one for each sample) would be accomplished by:

SplitSeq.py group -s reads.fastq -f SAMPLE

Which will create a set of files labelled SAMPLE-M1 and SAMPLE-M2, if samples are named M1 and M2.

If you wanted to split based on a numeric value, rather than a set of categorical values, then you would add the --num argument. SplitSeq would then create two files: one containing sequences with values less than the threshold specified by the --num argument and one file containing sequences with values greater than or equal to the threshold:

SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2

Which will create two files with the labels atleast-2 and under-2.

Random sampling from sequence files

The sample subcommand of SplitSeq may be used to generate a random sample from a sequence file or set of pair-end files. The example below will select a random sample of 1,000 sequences (-n 1000) which all contain the annotation SAMPLE=M1 (-f SAMPLE and -u M1):

SplitSeq.py sample -s reads.fastq -f SAMPLE -u M1 -n 1000

Performing an analogous sampling of Illumina paired-end reads would be accomplished using the samplepair subcommand:

SplitSeq.py samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina

Note

Both the -f and -n arguments will accept a list of values (eg, -n 1000 100 10), allowing you to sample multiple times from multiple files in one command.

Cleaning or removing poor quality sequences

Data sets can be cleaned using one or more invocations of FilterSeq, which provides multiple sequence quality control operations. Four subcommands remove sequences from the data that fail to meet some threshold: including length, (length), number of N or gap characters (missing), homopolymeric tract length (repeats), or mean Phred quality score (quality). Two subcommands modify sequences without removing them: trimqual truncates the sequences when the mean Phred quality scores decays under a threshold, and maskqual replaces positions with low Phred quality scores with N characters.

FilterSeq provides the following quality control subcommands:

Subcommand Operation
length Removes short sequences
missing Removes sequences with too many Ns or gaps
repeats Removes sequences with long homopolymeric tracts
quality Removes sequences with low mean quality scores
trimqual Truncates sequences where quality scores decay
maskqual Masks low quality positions

Assembling paired-end reads that do not overlap

The typical way to assemble paired-end reads is via de novo assembly using the align subcommand of AssemblePairs. However, some sequences with long CDR3 regions may fail to assemble due to insufficient, or completely absent, overlap between the mate-pairs. The reference or sequential subcommands can be used to assemble mate-pairs that do not overlap using the ungapped V-segment references sequences as a guide.

To handle such sequence in two separate steps, a normal align command would be performed first. The --failed argument is added so that the reads failing de novo alignment are output to separate files:

AssemblePairs.py align -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina --failed -outname align

Then, the files labeled assemble-fail, along with the ungapped V-segment reference sequences (-r vref.fasta), would be input into the reference subcommand of AssemblePairs:

AssemblePairs.py reference -1 align-1_assemble-fail.fastq -2 align-2_assemble-fail.fastq \
     --rc tail -r vref.fasta --coord illumina --outname ref

This will result in two separate assemble-pass files - one from each step. You may process them separately or concatenate them together into a single file:

cat align_assemble-pass.fastq ref_assemble-pass.fastq > merged_assemble-pass.fastq

However, if you intend to processes them together, you may simplify this by perform both steps using the sequential subcommand, which will attempt de novo assembly followed by reference guided assembly if de novo assembly fails:

AssemblePairs.py sequential -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina -r vref.fasta

Note

The sequences output by the reference or sequential subcommands may contain an appropriate length spacer of Ns between any mate-pairs that do not overlap. The :option:–fill <AssemblePairs reference –fill>`` argument may be specified to force AssemblePairs to insert the germline sequence into the missing positions, but this should be used with caution as the inserted sequence may not be biologically correct.

Assigning isotype annotations from the constant region sequence

MaskPrimers is usually used to remove primer regions and annotate sequences with primer identifiers. However, it can be used for any other case where you need to align a set of short sequences against the reads. One example alternate use is where you either do not know the C-region primer sequences or do not trust the primer region to provide an accurate isotype assignment.

If you build a FASTA file containing the reverse-complement of short sequences from the front of CH-1, then you can annotate the reads with these sequence in the same way you would C-region specific primers:

MaskPrimers.py align -s reads.fastq -p IGHC.fasta --maxlen 100 --maxerror 0.3 \
    --mode cut --revpr --pf C_CALL

Where --revpr tells MaskPrimers to reverse-complement the “primer” sequences and look for them at the end of the reads, --maxlen 100 restricts the search to the last 100 bp, --maxerror 0.3 allows for up to 30% mismatches, and -p IGHC.fasta specifies the file containing the CH-1 sequences. The name of the C-region will be added to the sequence headers as the C_CALL annotation, where the field name is specified by the --pf argument. An example CH-1 sequence file would look like:

>IGHD
CTGATATGATGGGGAACACATCCGGAGCCTTGGTGGGTGC
>IGHM
AGGAGACGAGGGGGAAAAGGGTTGGGGCGGATGCACTCCC
>IGHG
AGGGYGCCAGGGGGAAGACSGATGGGCCCTTGGTGGAAGC
>IGHA
MGAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGGGGATGC
>IGHE
AGCGGGTCAAGGGGAAGACGGATGGGCTCTGTGTGGAGGC

Download IGHC.fasta

See also

Constant region reference sequences may be downloaded from IMGT and the sequence headers can be reformated using the imgt subcommand of ConvertHeaders. Note, you may need to clean-up the reference sequences a bit before running ConvertHeaders if you receive an error about duplicate sequence names (eg, remove duplicate allele with different artificial splicing). To cut and reverse-complement the constant region sequences use something like seqmagick.

Estimating sequencing and PCR error rates with UMI data

The EstimateError tool provides methods for estimating the combined PCR and sequencing error rates from large UMI read groups. The assumptions being, that consensus sequences generated from sufficiently large UMI read groups should be accurate representations of the true sequences, and that the rate of mismatches from consensus should therefore be an accurate estimate of the error rate in the data. However, this is not guaranteed to be true, hence this approach can only be considered an estimate of a data set’s error profile. The following command generates an error profile from UMI read groups with 50 or more sequences (-n 50), using a majority rule consensus sequence (--mode freq), and excluding UMI read groups with high nucleotide diversity (--maxdiv 0.1):

EstimateError.py -s reads.fastq -n 50 --mode freq --maxdiv 0.1

This generates the following tab-delimited files containing error rates broken down by various criteria:

File Error profile
reads_error-position.tab Error rates by read position
reads_error-quality.tab Error rates by quality score
reads_error-nucleotide.tab Error rates by nucleotide identity
reads_error-set.tab Error rates by UMI read group size

Commandline Usage

AlignSets

Multiple aligns input sequences by group

usage: AlignSets [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
align-pass
multiple aligned reads.
align-fail
raw reads failing multiple alignment.
offsets-forward
5’ offset table for input into offset subcommand.
offsets-reverse
3’ offset table for input into offset subcommand.
output annotation fields:
None

AlignSets muscle

Align sequence sets using muscle.

usage: AlignSets muscle [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                        [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                        [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                        [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                        [--nproc NPROC] [--bf BARCODE_FIELD] [--div]
                        [--exec ALIGNER_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--bf <barcode_field>

The annotation field containing barcode labels for sequence grouping.

--div

Specify to calculate nucleotide diversity of each set (average pairwise error rate).

--exec <aligner_exec>

The name or location of the muscle executable.

AlignSets offset

Align sequence sets using predefined 5’ offset.

usage: AlignSets offset [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                        [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                        [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                        [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                        [--nproc NPROC] [-d OFFSET_TABLE] [--bf BARCODE_FIELD]
                        [--pf PRIMER_FIELD] [--mode {pad,cut}] [--div]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-d <offset_table>

The tab delimited file of offset tags and values.

--bf <barcode_field>

The annotation field containing barcode labels for sequence grouping.

--pf <primer_field>

The primer field to use for offset assignment.

--mode {pad,cut}

Specifies whether or align sequence by padding with gaps or by cutting the 5’ sequence to a common start position.

--div

Specify to calculate nucleotide diversity of each set (average pairwise error rate).

AlignSets table

Create a 5’ offset table by primer multiple alignment.

usage: AlignSets table [--version] [-h] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--failed]
                       [--delim DELIMITER DELIMITER DELIMITER] -p PRIMER_FILE
                       [-o OUT_FILE] [--reverse] [--exec ALIGNER_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-p <primer_file>

A FASTA file containing primer sequences.

-o <out_file>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--reverse

If specified create a 3’ offset table instead

--exec <aligner_exec>

The name or location of the muscle executable.

AssemblePairs

Assembles paired-end reads into a single sequence

usage: AssemblePairs [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
assemble-pass
successfully assembled reads.
assemble-fail
raw reads failing paired-end assembly.
output annotation fields:
<user defined>
annotation fields specified by the –1f or –2f arguments.

AssemblePairs align

Assemble pairs by aligning ends.

usage: AssemblePairs align [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...]
                           -2 SEQ_FILES_2 [SEQ_FILES_2 ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                           [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC]
                           [--coord {illumina,solexa,sra,454,presto}]
                           [--rc {tail,head,both,none}]
                           [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                           [--2f TAIL_FIELDS [TAIL_FIELDS ...]]
                           [--alpha ALPHA] [--maxerror MAX_ERROR]
                           [--minlen MIN_LEN] [--maxlen MAX_LEN] [--scanrev]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

--alpha <alpha>

Significance threshold for de novo paired-end assembly.

--maxerror <max_error>

Maximum allowable error rate for de novo assembly.

--minlen <min_len>

Minimum sequence length to scan for overlap in de novo assembly.

--maxlen <max_len>

Maximum sequence length to scan for overlap in de novo assembly.

--scanrev

If specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhang the end of the tail sequence.

AssemblePairs join

Assemble pairs by concatenating ends.

usage: AssemblePairs join [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
                          SEQ_FILES_2 [SEQ_FILES_2 ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                          [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                          [--nproc NPROC]
                          [--coord {illumina,solexa,sra,454,presto}]
                          [--rc {tail,head,both,none}]
                          [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                          [--2f TAIL_FIELDS [TAIL_FIELDS ...]] [--gap GAP]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

--gap <gap>

Number of N characters to place between ends.

AssemblePairs reference

Assemble pairs by aligning reads against a reference database.

usage: AssemblePairs reference [--version] [-h] -1 SEQ_FILES_1
                               [SEQ_FILES_1 ...] -2 SEQ_FILES_2
                               [SEQ_FILES_2 ...]
                               [-o OUT_FILES [OUT_FILES ...]]
                               [--outdir OUT_DIR] [--outname OUT_NAME]
                               [--log LOG_FILE] [--failed] [--fasta]
                               [--delim DELIMITER DELIMITER DELIMITER]
                               [--nproc NPROC]
                               [--coord {illumina,solexa,sra,454,presto}]
                               [--rc {tail,head,both,none}]
                               [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                               [--2f TAIL_FIELDS [TAIL_FIELDS ...]] -r
                               REF_FILE [--minident MIN_IDENT]
                               [--evalue EVALUE] [--maxhits MAX_HITS] [--fill]
                               [--aligner {blastn,usearch}]
                               [--exec ALIGNER_EXEC] [--dbexec DB_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

-r <ref_file>

A FASTA file containing the reference sequence database.

--minident <min_ident>

Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and 1).

--evalue <evalue>

Minimum E-value for reference alignment for both the head and tail sequence.

--maxhits <max_hits>

Maximum number of hits from the reference alignment to check for matching head and tail sequence assignments.

--fill

Specify to change the behavior of inserted characters when the head and tail sequences do not overlap during reference guided assembly. If specified, this will result in inserted of the V region reference sequence instead of a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences by using this option.

--aligner {blastn,usearch}

The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).

--exec <aligner_exec>

The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the –aligner argument.

--dbexec <db_exec>

The name or location of the executable file that builds the reference database. This defaults to makeblastdb when blastn is specified to the –aligner argument, and usearch when usearch is specified.

AssemblePairs sequential

Assemble pairs by first attempting de novo assembly, then reference guided assembly.

usage: AssemblePairs sequential [--version] [-h] -1 SEQ_FILES_1
                                [SEQ_FILES_1 ...] -2 SEQ_FILES_2
                                [SEQ_FILES_2 ...]
                                [-o OUT_FILES [OUT_FILES ...]]
                                [--outdir OUT_DIR] [--outname OUT_NAME]
                                [--log LOG_FILE] [--failed] [--fasta]
                                [--delim DELIMITER DELIMITER DELIMITER]
                                [--nproc NPROC]
                                [--coord {illumina,solexa,sra,454,presto}]
                                [--rc {tail,head,both,none}]
                                [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                                [--2f TAIL_FIELDS [TAIL_FIELDS ...]]
                                [--alpha ALPHA] [--maxerror MAX_ERROR]
                                [--minlen MIN_LEN] [--maxlen MAX_LEN]
                                [--scanrev] -r REF_FILE [--minident MIN_IDENT]
                                [--evalue EVALUE] [--maxhits MAX_HITS]
                                [--fill] [--aligner {blastn,usearch}]
                                [--exec ALIGNER_EXEC] [--dbexec DB_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

--alpha <alpha>

Significance threshold for de novo paired-end assembly.

--maxerror <max_error>

Maximum allowable error rate for de novo assembly.

--minlen <min_len>

Minimum sequence length to scan for overlap in de novo assembly.

--maxlen <max_len>

Maximum sequence length to scan for overlap in de novo assembly.

--scanrev

If specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhang the end of the tail sequence.

-r <ref_file>

A FASTA file containing the reference sequence database.

--minident <min_ident>

Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and 1).

--evalue <evalue>

Minimum E-value for reference alignment for both the head and tail sequence.

--maxhits <max_hits>

Maximum number of hits from the reference alignment to check for matching head and tail sequence assignments.

--fill

Specify to change the behavior of inserted characters when the head and tail sequences do not overlap during reference guided assembly. If specified, this will result in inserted of the V region reference sequence instead of a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences by using this option.

--aligner {blastn,usearch}

The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).

--exec <aligner_exec>

The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the –aligner argument.

--dbexec <db_exec>

The name or location of the executable file that builds the reference database. This defaults to makeblastdb when blastn is specified to the –aligner argument, and usearch when usearch is specified.

BuildConsensus

Builds a consensus sequence for each set of input sequences

usage: BuildConsensus [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                      [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                      [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                      [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                      [--nproc NPROC] [-n MIN_COUNT] [--bf BARCODE_FIELD]
                      [-q MIN_QUAL] [--freq MIN_FREQ] [--maxgap MAX_GAP]
                      [--pf PRIMER_FIELD] [--prcons PRIMER_FREQ]
                      [--cf COPY_FIELDS [COPY_FIELDS ...]]
                      [--act {min,max,sum,set,majority} [{min,max,sum,set,majority} ...]]
                      [--dep] [--maxdiv MAX_DIVERSITY | --maxerror MAX_ERROR]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <min_count>

The minimum number of sequences needed to define a valid consensus.

--bf <barcode_field>

Position of description barcode field to group sequences by.

-q <min_qual>

Consensus quality score cut-off under which an ambiguous character is assigned; does not apply when quality scores are unavailable.

--freq <min_freq>

Fraction of character occurrences under which an ambiguous character is assigned.

--maxgap <max_gap>

If specified, this defines a cut-off for the frequency of allowed gap values for each position. Positions exceeding the threshold are deleted from the consensus. If not defined, positions are always retained.

--pf <primer_field>

Specifies the field name of the primer annotations

--prcons <primer_freq>

Specify to define a minimum primer frequency required to assign a consensus primer, and filter out sequences with minority primers from the consensus building step.

--cf <copy_fields>

Specifies a set of additional annotation fields to copy into the consensus sequence annotations.

--act {min,max,sum,set,majority}

List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” combines annotations into a comma delimited list of unique values and adds an annotation named <FIELD>_COUNT specifying the count of each item in the set. The action “majority” assigns the most frequent annotation to the consensus annotation and adds an annotation named <FIELD>_FREQ specifying the frequency of the majority value.

--dep

Specify to calculate consensus quality with a non-independence assumption

--maxdiv <max_diversity>

Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and remove groups exceeding the given diversity threshold. Diversity is calculate for all positions within the read group, ignoring any character filtering imposed by the -q, –freq and –maxgap arguments. Mutually exclusive with –maxerror.

--maxerror <max_error>

Specify to calculate the error rate of each read group (rate of mismatches from consensus) and remove groups exceeding the given error threshold. The error rate is calculated against the final consensus sequence, which may include masked positions due to the -q and –freq arguments and may have deleted positions due to the –maxgap argument. Mutually exclusive with –maxdiv.

output files:
consensus-pass
consensus reads.
consensus-fail
raw reads failing consensus filtering criteria.
output annotation fields:
PRIMER
a comma delimited list of unique primer annotations found within the barcode read group.
PRCOUNT
a comma delimited list of the corresponding counts of unique primer annotations.
PRCONS
the majority primer within the barcode read group.
PRFREQ
the frequency of the majority primer.
CONSCOUNT
the count of reads within the barcode read group which contributed to the consensus sequence. This is the total size of the read group, minus sequence excluded due to user defined filtering criteria.

ClusterSets

Cluster sequences by group

usage: ClusterSets [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
cluster-pass
clustered reads.
cluster-fail
raw reads failing clustering.
output annotation fields:
CLUSTER
a numeric cluster identifier defining the within-group cluster.

ClusterSets all

Cluster all sequences regardless of annotation.

usage: ClusterSets all [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--fasta]
                       [--delim DELIMITER DELIMITER DELIMITER] [--nproc NPROC]
                       [-k CLUSTER_FIELD] [--ident IDENT]
                       [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                       [--cluster {usearch,vsearch,cd-hit-est}]
                       [--exec CLUSTER_EXEC] [--start SEQ_START]
                       [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.

ClusterSets barcode

Cluster reads by clustering barcode sequences.

usage: ClusterSets barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                           [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                           [--cluster {usearch,vsearch,cd-hit-est}]
                           [--exec CLUSTER_EXEC] [-f BARCODE_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <barcode_field>

The annotation field containing barcode sequences.

ClusterSets set

Cluster sequences within annotation sets.

usage: ClusterSets set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                       [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                       [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                       [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                       [--cluster {usearch,vsearch,cd-hit-est}]
                       [--exec CLUSTER_EXEC] [-f SET_FIELD]
                       [--start SEQ_START] [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <set_field>

The annotation field containing annotations, such as UMI barcode, for sequence grouping.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.

CollapseSeq

Removes duplicate sequences from FASTA/FASTQ files

usage: CollapseSeq [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                   [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                   [--outname OUT_NAME] [--log LOG_FILE] [--failed] [--fasta]
                   [--delim DELIMITER DELIMITER DELIMITER] [-n MAX_MISSING]
                   [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
                   [--cf COPY_FIELDS [COPY_FIELDS ...]]
                   [--act {min,max,sum,set} [{min,max,sum,set} ...]] [--inner]
                   [--keepmiss] [--maxf MAX_FIELD | --minf MIN_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_missing>

Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides.

--uf <uniq_fields>

Specifies a set of annotation fields that must match for sequences to be considered duplicates.

--cf <copy_fields>

Specifies a set of annotation fields to copy into the unique sequence output.

--act {min,max,sum,set}

List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.

--inner

If specified, exclude consecutive missing characters at either end of the sequence.

--keepmiss

If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.

--maxf <max_field>

Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.

--minf <min_field>

Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.

output files:
collapse-unique
unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.
collapse-duplicate
raw reads which are duplicates of the sequences retained in the collapse-unique file.
collapse-undetermined
raw reads which were excluded from consideration due to having too many N characters in the sequence.
output annotation fields:
DUPCOUNT
total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.
<user defined>
annotation fields specified by the –cf parameter.

ConvertHeaders

Converts sequence headers to the pRESTO format

usage: ConvertHeaders [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
convert-pass
reads passing header conversion.
convert-fail
raw reads failing header conversion.
output annotation fields:
<format defined>
the annotation fields added are specific to the header format of the input file.

ConvertHeaders 454

Converts Roche 454 sequence headers.

usage: ConvertHeaders 454 [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--failed] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders genbank

Converts NCBI GenBank and RefSeq
sequence headers.
usage: ConvertHeaders genbank [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders generic

Converts sequence headers without a known
annotation system.
usage: ConvertHeaders generic [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders illumina

Converts Illumina sequence headers.

usage: ConvertHeaders illumina [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                               [-o OUT_FILES [OUT_FILES ...]]
                               [--outdir OUT_DIR] [--outname OUT_NAME]
                               [--failed] [--fasta]
                               [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders imgt

Converts sequence headers output by
IMGT/GENE-DB.
usage: ConvertHeaders imgt [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--failed] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER] [--simple]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--simple

If specified, only the allele name, and no other annotations, will appear in the converted sequence header.

ConvertHeaders migec

Converts headers for consensus sequence generated
by the MIGEC tool.
usage: ConvertHeaders migec [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--failed] [--fasta]
                            [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders sra

Converts NCBI SRA or EMBL-EBI ENA sequence headers.

usage: ConvertHeaders sra [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--failed] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

EstimateError

Calculates annotation set error rates

usage: EstimateError [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
error-position
estimated error by read position.
error-quality
estimated error by the quality score assigned within the input file.
error-nucleotide
estimated error by nucleotide.
error-set
estimated error by annotation set size.
distance-set
pairwise hamming distances by annotation set.
threshold-set
thresholds from pairwise hamming distances for annotation sets.
distance-barcode
estimated error by pairwise hamming distances
threshold-barcode
thresholds from pairwise hamming distances for clustering barcodes
output fields:
POSITION
read position with base zero indexing.
Q
Phred quality score.
OBSERVED
observed nucleotide value.
REFERENCE
consensus nucleotide for the barcode read group.
SET_COUNT
barcode read group size.
REPORTED_Q
mean Phred quality score reported within the input file for the given position, quality score, nucleotide or read group.
MISMATCHES
count of observed mismatches from consensus for the given position, quality score, nucleotide or read group.
OBSERVATIONS
total count of observed values for each position, quality score, nucleotide or read group size.
ERROR
estimated error rate.
EMPIRICAL_Q
estimated error rate converted to a Phred quality score.
ALL
histogram (count) of all pairwise distance distribution.
DTN
histogram (count) of distance to nearest distribution.
DISTANCE
length normalized hamming distance.

EstimateError barcode

Calculates pairwise distance metrics of barcode sequences.

usage: EstimateError barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [--outdir OUT_DIR] [--outname OUT_NAME]
                             [--delim DELIMITER DELIMITER DELIMITER]
                             [-f BARCODE_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <barcode_field>

The name of the barcode field.

EstimateError set

Estimates error statistics within annotation sets.

usage: EstimateError set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [--outdir OUT_DIR] [--outname OUT_NAME]
                         [--log LOG_FILE]
                         [--delim DELIMITER DELIMITER DELIMITER]
                         [--nproc NPROC] [-f SET_FIELD] [-n MIN_COUNT]
                         [--mode {freq,qual}] [-q MIN_QUAL] [--freq MIN_FREQ]
                         [--maxdiv MAX_DIVERSITY]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-f <set_field>

The name of the annotation field to group sequences by

-n <min_count>

The minimum number of sequences needed to consider a set

--mode {freq,qual}

Specifies which method to use to determine the consensus sequence. The “freq” method will determine the consensus by nucleotide frequency at each position and assign the most common value. The “qual” method will weight values by their quality scores to determine the consensus nucleotide at each position.

-q <min_qual>

Consensus quality score cut-off under which an ambiguous character is assigned.

--freq <min_freq>

Fraction of character occurrences under which an ambiguous character is assigned.

--maxdiv <max_diversity>

Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and exclude groups which exceed the given diversity threshold.

FilterSeq

Filters sequences in FASTA/FASTQ files

usage: FilterSeq [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
<command>-pass
reads passing filtering operation and modified accordingly, where <command> is the name of the filtering operation that was run.
<command>-fail
raw reads failing filtering criteria, where <command> is the name of the filtering operation.
output annotation fields:
None

FilterSeq length

Filters reads by length.

usage: FilterSeq length [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                        [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                        [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                        [--fasta] [--nproc NPROC] [-n MIN_LENGTH] [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <min_length>

Minimum sequence length to retain.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq maskqual

Masks low quality positions.

usage: FilterSeq maskqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                          [--fasta] [--nproc NPROC] [-q MIN_QUAL]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-q <min_qual>

Quality score threshold.

FilterSeq missing

Filters reads by N or gap character count.

usage: FilterSeq missing [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                         [--fasta] [--nproc NPROC] [-n MAX_MISSING] [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <max_missing>

Threshold for fraction of gap or N nucleotides.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq quality

Filters reads by quality score.

usage: FilterSeq quality [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                         [--fasta] [--nproc NPROC] [-q MIN_QUAL] [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-q <min_qual>

Quality score threshold.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq repeats

Filters reads by consecutive nucleotide repeats.

usage: FilterSeq repeats [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                         [--fasta] [--nproc NPROC] [-n MAX_REPEAT] [--missing]
                         [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <max_repeat>

Threshold for fraction of repeating nucleotides.

--missing

If specified count consecutive gap and N characters ‘ in addition to {A,C,G,T}.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq trimqual

Trims sequences by quality score decay.

usage: FilterSeq trimqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                          [--fasta] [--nproc NPROC] [-q MIN_QUAL]
                          [--win WINDOW] [--reverse]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-q <min_qual>

Quality score threshold.

--win <window>

Nucleotide window size for moving average calculation.

--reverse

Specify to trim the head of the sequence rather than the tail.

MaskPrimers

Removes primers and annotates sequences with primer and barcode identifiers

usage: MaskPrimers [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
mask-pass
processed reads with successful primer matches.
mask-fail
raw reads failing primer identification.
output annotation fields:
SEQORIENT
the orientation of the output sequence. Either F (input) or RC (reverse complement of input).
PRIMER
name of the best primer match.
BARCODE
the sequence preceding the primer match. Only output when the –barcode flag is specified.

MaskPrimers align

Find primer matches using pairwise local alignment.

usage: MaskPrimers align [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                         [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                         [--nproc NPROC] -p PRIMER_FILE [--maxerror MAX_ERROR]
                         [--maxlen MAX_LEN] [--gap GAP_PENALTY GAP_PENALTY]
                         [--revpr] [--skiprc] [--mode {cut,mask,trim,tag}]
                         [--barcode] [--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-p <primer_file>

A FASTA file containing primer sequences.

--maxerror <max_error>

Maximum allowable error rate.

--maxlen <max_len>

Length of the sequence window to scan for primers.

--gap <gap_penalty>

A list of two positive values defining the gap open and gap extension penalties for aligning the primers. Note: the error rate is calculated as the percentage of mismatches from the primer sequence with gap penalties reducing the match count accordingly; this may lead to error rates that differ from strict mismatch percentage when gaps are present in the alignment.

--revpr

Specify to match the tail-end of the sequence against the reverse complement of the primers. This also reverses the behavior of the –maxlen argument, such that the search window begins at the tail-end of the sequence.

--skiprc

Specify to prevent checking of sample reverse complement sequences.

--mode {cut,mask,trim,tag}

Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region and the preceding sequence. The “mask” mode will replace the primer region with Ns and remove the preceding sequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The “tag” mode will leave the input sequence unmodified.

--barcode

Specify to annotate reads sequences with barcode sequences (unique molecular identifiers) found preceding the primer.

--bf <barcode_field>

Name of the barcode annotation field.

--pf <primer_field>

Name of the annotation field containing the primer name.

MaskPrimers extract

Remove and annotate a fixed sequence region.

usage: MaskPrimers extract [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                           [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [--start START] --len LENGTH
                           [--revpr] [--mode {cut,mask,trim,tag}] [--barcode]
                           [--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--start <start>

The starting position of the sequence region to extract.

--len <length>

The length of the sequence to extract.

--revpr

Specify to extract from the tail end of the sequence with the start position relative to the end of the sequence.

--mode {cut,mask,trim,tag}

Specifies the action to take with the sequence region. The “cut” mode will remove the region. The “mask” mode will replace the specified region with Ns. The “trim” mode will remove the sequence preceding the specified region, but leave the region intact. The “tag” mode will leave the input sequence unmodified.

--barcode

Specify to remove the sequence preceding the extracted region and annotate the read with that sequence.

--bf <barcode_field>

Name of the barcode annotation field.

--pf <primer_field>

Name of the annotation field containing the extracted sequence region.

MaskPrimers score

Find primer matches by scoring primers at a fixed position.

usage: MaskPrimers score [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                         [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                         [--nproc NPROC] -p PRIMER_FILE [--start START]
                         [--maxerror MAX_ERROR] [--revpr]
                         [--mode {cut,mask,trim,tag}] [--barcode]
                         [--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-p <primer_file>

A FASTA file containing primer sequences.

--start <start>

The starting position of the primer.

--maxerror <max_error>

Maximum allowable error rate.

--revpr

Specify to match the tail-end of the sequence against the reverse complement of the primers. This also reverses the behavior of the –start argument, such that start position is relative to the tail-end of the sequence.

--mode {cut,mask,trim,tag}

Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region and the preceding sequence. The “mask” mode will replace the primer region with Ns and remove the preceding sequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The “tag” mode will leave the input sequence unmodified.

--barcode

Specify to annotate reads sequences with barcode sequences (unique molecular identifiers) found preceding the primer.

--bf <barcode_field>

Name of the barcode annotation field.

--pf <primer_field>

Name of the annotation field containing the primer name.

PairSeq

Sorts and matches sequence records with matching coordinates across files

usage: PairSeq [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
               SEQ_FILES_2 [SEQ_FILES_2 ...] [--outdir OUT_DIR]
               [--outname OUT_NAME] [--failed] [--fasta]
               [--delim DELIMITER DELIMITER DELIMITER]
               [--1f FIELDS_1 [FIELDS_1 ...]] [--2f FIELDS_2 [FIELDS_2 ...]]
               [--act {min,max,sum,set,cat}]
               [--coord {illumina,solexa,sra,454,presto}]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--1f <fields_1>

The annotation fields to copy from file 1 records into file 2 records. If a copied annotation already exists in a file 2 record, then the annotations copied from file 1 will be added to the front of the existing annotation.

--2f <fields_2>

The annotation fields to copy from file 2 records into file 1 records. If a copied annotation already exists in a file 1 record, then the annotations copied from file 2 will be added to the end of the existing annotation.

--act {min,max,sum,set,cat}

The collapse actions to take on all fields copied between files to combine duplicate fields into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string. Only applies if the field already exists in the header before being copying from the other file.

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across mate pairs.

output files:
pair-pass
successfully paired reads with modified annotations.
pair-fail
raw reads that could not be assigned to a mate-pair.
output annotation fields:
<user defined>
annotation fields specified by the –1f or –2f arguments.

ParseHeaders

Parses pRESTO annotations in FASTA/FASTQ sequence headers

usage: ParseHeaders [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
reheader-pass
reads passing annotation operation and modified accordingly.
reheader-fail
raw reads failing annotation operation.
headers
tab delimited table of the selected annotations.
output annotation fields:
<user defined>
annotation fields specified by the -f argument.

ParseHeaders add

Adds field/value pairs to header annotations

usage: ParseHeaders add [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                        [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                        [--outname OUT_NAME] [--failed] [--fasta]
                        [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                        [FIELDS ...] -u VALUES [VALUES ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to add.

-u <values>

List of values to add for each field.

ParseHeaders collapse

Collapses header annotations with multiple entries

usage: ParseHeaders collapse [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--failed] [--fasta]
                             [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                             [FIELDS ...] --act
                             {min,max,sum,first,last,set,cat}
                             [{min,max,sum,first,last,set,cat} ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to collapse.

--act {min,max,sum,first,last,set,cat}

List of actions to take for each field defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

ParseHeaders copy

Copies header annotation fields

usage: ParseHeaders copy [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] [--failed] [--fasta]
                         [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                         [FIELDS ...] -k NAMES [NAMES ...]
                         [--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to copy.

-k <names>

List of names for each copied field. If the new field is already present, the copied field will be merged into the existing field.

--act {min,max,sum,first,last,set,cat}

List of collapse actions to take on each new field following the copy operation defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

ParseHeaders delete

Deletes fields from header annotations

usage: ParseHeaders delete [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--failed] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                           [FIELDS ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to delete.

ParseHeaders expand

Expands annotation fields with multiple values

usage: ParseHeaders expand [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--failed] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                           [FIELDS ...] [--sep SEPARATOR]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to expand.

--sep <separator>

The character separating each value in the fields.

ParseHeaders merge

Merge multiple annotations fields into a single field

usage: ParseHeaders merge [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--failed] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                          [FIELDS ...] -k NAME [--act {min,max,sum,set,cat}]
                          [--delete]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to merge.

-k <name>

Name for the merged field. If the new field is already present, the merged fields will be merged into the existing field.

--act {min,max,sum,set,cat}

List of collapse actions to take on the new field following the merge defining how to combine the annotations into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

--delete

If specified, delete the field that were merged from the output header.

ParseHeaders rename

Renames header annotation fields

usage: ParseHeaders rename [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--failed] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                           [FIELDS ...] -k NAMES [NAMES ...]
                           [--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to rename.

-k <names>

List of new names for each field. If the new field is already present, the renamed field will be merged into the existing field and the old field will be deleted.

--act {min,max,sum,first,last,set,cat}

List of collapse actions to take on each new field following the rename operation defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

ParseHeaders table

Writes sequence headers to a table

usage: ParseHeaders table [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--failed]
                          [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                          [FIELDS ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.

ParseLog

Parses records in the console log of pRESTO modules

usage: ParseLog [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
                [--outdir OUT_DIR] [--outname OUT_NAME]
                [--delim DELIMITER DELIMITER DELIMITER] -l RECORD_FILES
                [RECORD_FILES ...] -f FIELDS [FIELDS ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-l <record_files>

List of log files to parse.

-f <fields>

List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.

output files:
table
tab delimited table of the selected annotations.
output annotation fields:
<user defined>
annotation fields specified by the -f argument.

SplitSeq

Sorts, samples and splits FASTA/FASTQ sequence files

usage: SplitSeq [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
part<part>
reads partitioned by count, where <part> is the partition number.
<field>-<value>
reads partitioned by annotation <field> and <value>.
under-<number>
reads partitioned by numeric threshold where the annotation value is strictly less than the threshold <number>.
atleast-<number>
reads partitioned by numeric threshold where the annotation value is greater than or equal to the threshold <number>.
sorted
reads sorted by annotation value.
sorted-part<part>
reads sorted by annotation value and partitioned by count, where <part> is the partition number.
sample<i>-n<count>
randomly sampled reads where <i> is a number specifying the sampling instance and <count> is the number of sampled reads.
selected
reads passing selection criteria.
output annotation fields:
None

SplitSeq count

Splits sequences files by number of records.

usage: SplitSeq count [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                      [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta] -n
                      MAX_COUNT
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

-n <max_count>

Maximum number of sequences in each new file

SplitSeq group

Splits sequences files by annotation.

usage: SplitSeq group [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                      [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                      [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                      [--num THRESHOLD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>

Annotation field to split sequence files by

--num <threshold>

Specify to define the split field as numeric and group sequences by value.

SplitSeq sample

Randomly samples from unpaired sequences files.

usage: SplitSeq sample [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                       [--delim DELIMITER DELIMITER DELIMITER] -n MAX_COUNT
                       [MAX_COUNT ...] [-f FIELD] [-u VALUES [VALUES ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_count>

Maximum number of sequences to sample from each file, field or annotation set. The default behavior, without the -f argument, is to sample from the complete set of sequences in the input file.

-f <field>

The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.

-u <values>

If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.

SplitSeq samplepair

Randomly samples from paired-end sequences files.

usage: SplitSeq samplepair [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...]
                           -2 SEQ_FILES_2 [SEQ_FILES_2 ...] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER] -n
                           MAX_COUNT [MAX_COUNT ...] [-f FIELD]
                           [-u VALUES [VALUES ...]]
                           [--coord {illumina,solexa,sra,454,presto}]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_count>

Maximum number of paired sequences to sample from each set of files, fields or annotations. The default behavior, without the -f argument, is to sample from the complete set of paired sequences in the input files.

-f <field>

The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.

-u <values>

If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired read files.

SplitSeq select

Selects sequences from sequence files by annotation.

usage: SplitSeq select [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--fasta]
                       [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                       [-u VALUE_LIST [VALUE_LIST ...] | -t VALUE_FILE]
                       [--not]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>

The annotation field for selection criteria.

-u <value_list>

A list of values to select for in the specified field. Mutually exclusive with -t.

-t <value_file>

A tab delimited file specifying values to select for in the specified field. The file must be formatted with the given field name in the header row. Values will be taken from that column. Mutually exclusive with -u.

--not

If specified, will perform negative matching. Meaning, sequences will be selected if they fail to match for all specified values.

SplitSeq sort

Sorts sequences files by annotation.

usage: SplitSeq sort [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                     [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                     [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                     [-n MAX_COUNT] [--num]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>

The annotation field to sort sequences by.

-n <max_count>

Maximum number of sequences in each new file.

--num

Specify to define the sort field as numeric rather than textual.

UnifyHeaders

Unifies annotation fields based on grouping scheme

usage: UnifyHeaders [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
unify-pass
Reads passing annotation filtering or consensus.
unify-fail
Reading failing filtering.
output annotation fields:
<user defined>
annotation fields specified by the -f and -k arguments.

UnifyHeaders consensus

Reassign fields to consensus values.

usage: UnifyHeaders consensus [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--log LOG_FILE] [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
                              [--nproc NPROC] [-f SET_FIELD] [-k UNIFY_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-f <set_field>

The annotation field containing annotations, such as the UMI barcode, for sequence grouping.

-k <unify_field>

The name of the annotation field to find a consensus for per each sequence group.

UnifyHeaders delete

Delete sequences with differing field values.

usage: UnifyHeaders delete [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                           [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [-f SET_FIELD] [-k UNIFY_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-f <set_field>

The annotation field containing annotations, such as the UMI barcode, for sequence grouping.

-k <unify_field>

The name of the annotation field to find a consensus for per each sequence group.

API

presto.Annotation

Annotation functions

presto.Annotation.addHeader(header, fields, values, delimiter=('|', '=', ', '))

Adds fields and values to a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the list of fields to add or append to.
  • values – the list of annotation values to add for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.annotationConsensus(seq_iter, field, delimiter=('|', '=', ', '))

Calculate a consensus annotation for a set of sequences

Parameters:
  • seq_iter – an iterator or list of SeqRecord objects
  • field – the annotation field to take a consensus of
  • delimiter – a tuple of delimiters for (annotations, field/values, value lists)
Returns:

Dictionary with keys

set containing a list of unique annotation values, count containing annotation counts, cons containing the consensus annotation, freq containing the majority annotation frequency

Return type:

dict

presto.Annotation.collapseAnnotation(ann_dict, action, fields=None, delimiter=('|', '=', ', '))

Collapses multiple annotations into new single annotations for each field

Parameters:
  • ann_dict – dictionary of field/value pairs
  • action – collapse action to take; one of {min, max, sum, first, last, set, cat}
  • fields – subset of ann_dict to _collapse; if None, collapse all but the ID field
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Modified field dictionary

Return type:

OrderedDict

presto.Annotation.collapseHeader(header, fields, actions, delimiter=('|', '=', ', '))

Collapses a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the list of fields to collapse.
  • actions – the list of collapse action take; one of (max, min, sum, first, last, set, cat) for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.convert454Header(desc)

Parses 454 headers into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

New style 454 header:

@<accession> <length=##>
@GXGJ56Z01AE06X length=222

Old style 454 header:

@<rank_x_y> <length=##> <uaccno=accession>
@000034_0199_0169 length=437 uaccno=GNDG01201ARRCR
presto.Annotation.convertGenbankHeader(desc, delimiter=('|', '=', ', '))

Converts GenBank and RefSeq headers into the pRESTO format

Parameters:
  • desc (str) – a sequence description string.
  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).
Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

New style GenBank header:

<accession>.<version> <description>
>CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly

Old style GenBank header:

gi|<GI record number>|<dbsrc>|<accession>.<version>|<description>
>gi|568336023|gb|CM000663.2| Homo sapiens chromosome 1, GRCh38 reference primary assembly
presto.Annotation.convertGenericHeader(desc, delimiter=('|', '=', ', '))

Converts any header to the pRESTO format

Parameters:
  • desc (str) – a sequence description string.
  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).
Returns:

a dictionary of header field and value pairs.

Return type:

dict

presto.Annotation.convertIMGTHeader(desc, simple=False)

Converts germline headers from IMGT/GENE-DB into the pRESTO format

Parameters:
  • desc (str) – a sequence description string.
  • simple (bool) – if True then the header will be converted to only the allele name.
Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

IMGT header:

>X60503|IGHV1-18*02|Homo sapiens|F|V-REGION|142..417|276 nt|1| | | | |276+24=300|partial in 3'| |

Header contains 15 fields separated by | (http://imgt.org/genedb):

  1. IMGT/LIGM-DB accession number(s).
  2. Gene and allele name.
  3. Species.
  4. Functionality.
  5. Exon(s), region name(s), or extracted label(s).
  6. Start and end positions in the IMGT/LIGM-DB accession number(s).
  7. Number of nucleotides in the IMGT/LIGM-DB accession number(s).
  8. Codon start, or ‘NR’ (not relevant) for non coding labels and out-of-frame pseudogenes.
  9. Number of nucleotides added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB.
  10. Number of nucleotides added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB.
  11. Number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or ‘not corrected’ if non corrected sequencing errors.
  12. Number of amino acids (AA). This field indicates that the sequence is in amino acids.
  13. Number of characters in the sequence. Nucleotides (or AA) plus IMGT gaps.
  14. Partial (if it is).
  15. Reverse complementary (if it is).
presto.Annotation.convertIlluminaHeader(desc)

Converts Illumina headers into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

New style Illumina header:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read number>:<is filtered>:<control number>:<index sequence>
@MISEQ:132:000000000-A2F3U:1:1101:14340:1555 2:N:0:ATCACG

Old style Illumina header:

@<instrument>:<flowcell lane>:<tile>:<x-pos>:<y-pos>#<index sequence>/<read number>
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
presto.Annotation.convertMIGECHeader(desc)

Parses headers from the MIGEC tool into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

MIGEC header:

@MIG UMI:<UMI sequence>:<consensus read count>
@MIG UMI:TCGGCCAACAAA:8
presto.Annotation.convertSRAHeader(desc)

Parses NCBI SRA or EMBL-EBI ENA headers into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

Header from fastq-dump --split-files:

@<accession>.<spot> <original sequence description> <length=#>
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
@SRR1383326.1 1 length=250

Header from fastq-dump --split-files -I:

@<accession>.<spot>.<read number> <original sequence description> <length=#>
@SRR1383326.1.1 1 length=250

Header from ENA:

@<accession>.<spot> <original sequence description>
@ERR220397.1 HKSQ1MM01DXT2W/3
@ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/1
@ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/2
presto.Annotation.copyHeader(header, fields, names, actions=None, delimiter=('|', '=', ', '))

Copies fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – a list of the field names to copy.
  • names – a list of the new field names.
  • actions – the list of collapse action take after the copy; one of (max, min, sum, first, last, set, cat) for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.deleteHeader(header, fields, delimiter=('|', '=', ', '))

Deletes fields from a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the list of fields to delete.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary

Return type:

dict

presto.Annotation.expandHeader(header, fields, separator=', ', delimiter=('|', '=', ', '))

Splits and annotation value into separate fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the field to split.
  • separator – the delimiter to split the values by.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.flattenAnnotation(ann_dict, delimiter=('|', '=', ', '))

Converts annotations from a dictionary to a FASTA/FASTQ sequence description

Parameters:
  • ann_dict – Dictionary of field/value pairs
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Formatted sequence description string

Return type:

str

presto.Annotation.getAnnotationValues(seq_iter, field, unique=False, delimiter=('|', '=', ', '))

Gets the set of unique annotation values in a sequence set

Parameters:
  • seq_iter – Iterator or list of SeqRecord objects
  • field – Annotation field to retrieve values for
  • unique – If True return a list of only the unique values; if False return a list of all values
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

List of values for the field

Return type:

list

presto.Annotation.getCoordKey(header, coord_type='presto', delimiter=('|', '=', ', '))

Return the coordinate identifier for a sequence description

Parameters:
  • header – Sequence header string
  • coord_type – Sequence header format; one of [‘illumina’, ‘solexa’, ‘sra’, ‘454’, ‘presto’]; if unrecognized type or None return sequence ID.
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Coordinate identifier as a string

Return type:

str

presto.Annotation.mergeAnnotation(ann_dict_1, ann_dict_2, prepend=False, delimiter=('|', '=', ', '))

Merges non-ID field annotations from one field dictionary into another

Parameters:
  • ann_dict_1 – Dictionary of field/value pairs to append to
  • ann_dict_2 – Dictionary of field/value pairs to merge with ann_dict_2
  • prepend – If True then add ann_dict_2 values to the front of any ann_dict_1 values that are already present, rather than the default behavior of appending ann_dict_2 values.
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Modified ann_dict_1 dictonary of field/value pairs

Return type:

OrderedDict

presto.Annotation.mergeHeader(header, fields, name, action=None, delete=False, delimiter=('|', '=', ', '))

Merges fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – a list of the field names to merge.
  • name – the name of the new field.
  • delete – if True delete the merged fields.
  • actions – the list of collapse action take after the merge one of (max, min, sum, first, last, set, cat).
  • delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.parseAnnotation(record, fields=None, delimiter=('|', '=', ', '))

Extracts annotations from a FASTA/FASTQ sequence description

Parameters:
  • record – Description string to extract annotations from
  • fields – List of fields to subset the return dictionary to; if None return all fields
  • delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:

An OrderedDict of field/value pairs

Return type:

OrderedDict

presto.Annotation.parseLog(record)

Parses an pRESTO log record

Parameters:record (str) – a string of lines representing a log record including newline characters.
Returns:parsed log contain field and values pairs as a dictionary.
Return type:collections.OrderedDict
presto.Annotation.renameAnnotation(ann_dict, old_field, new_field, delimiter=('|', '=', ', '))

Renames an annotation and merges annotations if the new name already exists

Parameters:
  • ann_dict – Dictionary of field/value pairs
  • old_field – Old field name
  • new_field – New field name
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Modified fields dictonary

Return type:

OrderedDict

presto.Annotation.renameHeader(header, fields, names, actions=None, delimiter=('|', '=', ', '))

Renames fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – a list of the current field names.
  • names – a list of the new field names.
  • actions – the list of collapse action take after the rename; one of (max, min, sum, first, last, set, cat) for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Applications

External application wrappers

presto.Applications.makeBlastnDb(ref_file, db_exec='makeblastdb')

Makes a ublast database file

Parameters:
  • ref_file – the path to the reference database file
  • db_exec – the path to the makeblastdb executable
Returns:

(name and location of the database, handle of the tempfile.TemporaryDirectory)

Return type:

tuple

presto.Applications.makeUBlastDb(ref_file, db_exec='usearch')

Makes a ublast database file

Parameters:
  • ref_file – path to the reference database file.
  • db_exec – path to the usearch executable.
Returns:

(location of the database, handle of the tempfile.NamedTemporaryFile)

Return type:

tuple

presto.Applications.runBlastn(seq, database, evalue=1e-05, max_hits=100, aligner_exec='blastn')

Aligns a sequence against a reference database using BLASTN

Parameters:
  • seq – a list of SeqRecord objects to align.
  • database – the path and name of the blastn database.
  • evalue – the E-value cut-off.
  • maxhits – the maximum number of hits returned.
  • aligner_exec – the path to the blastn executable.
Returns:

Alignment results.

Return type:

pandas.DataFrame

presto.Applications.runCDHit(seq_list, ident=0.9, length_ratio=0.0, seq_start=0, seq_end=None, max_memory=3000, threads=1, cluster_exec='cd-hit-est')

Cluster a set of sequences using CD-HIT

Parameters:
  • seq_list (list) – a list of SeqRecord objects to align.
  • ident (float) – the sequence identity cutoff to be passed to cd-hit-est.
  • length_ratio (float) – cd-hit-est parameter defining the minimum short/long length ratio allowed within a cluster.
  • seq_start (int) – the start position to trim sequences at before clustering.
  • seq_end (int) – the end position to trim sequences at before clustering.
  • max_memory (int) – cd-hit-est max memory limit (Mb)
  • threads (int) – number of threads for cd-hit-est.
  • cluster_exec (str) – the path to the cd-hit-est executable.
Returns:

{cluster id: list of sequence ids}.

Return type:

dict

presto.Applications.runMuscle(seq_list, aligner_exec='muscle')

Multiple aligns a set of sequences using MUSCLE

Parameters:
  • seq_list – a list of SeqRecord objects to align
  • aligner_exec – the MUSCLE executable
Returns:

Multiple alignment results.

Return type:

Bio.Align.MultipleSeqAlignment

presto.Applications.runUBlast(seq, database, evalue=1e-05, max_hits=100, aligner_exec='usearch')

Aligns a sequence against a reference database using the usearch_local algorithm of USEARCH

Parameters:
  • seq – a list of SeqRecord objects to align.
  • database – the path to the ublast database or a fasta file.
  • evalue – the E-value cut-off.
  • maxhits – the maximum number of hits returned.
  • aligner_exec – the path to the usearch executable.
Returns:

Alignment results.

Return type:

pandas.DataFrame

presto.Applications.runUClust(seq_list, ident=0.9, length_ratio=0.0, seq_start=0, seq_end=None, threads=1, cluster_exec='usearch')

Cluster a set of sequences using the UCLUST algorithm from USEARCH

Parameters:
  • seq_list (list) – a list of SeqRecord objects to align.
  • ident (float) – the sequence identity cutoff to be passed to usearch.
  • length_ratio (float) – usearch parameter defining the minimum short/long length ratio allowed within a cluster.
  • seq_start (int) – the start position to trim sequences at before clustering.
  • seq_end (int) – the end position to trim sequences at before clustering.
  • threads (int) – number of threads for usearch.
  • cluster_exec (str) – the path to the usearch executable.
Returns:

{cluster id: list of sequence ids}.

Return type:

dict

presto.Commandline

Commandline interface functions

class presto.Commandline.CommonHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: argparse.RawDescriptionHelpFormatter, argparse.ArgumentDefaultsHelpFormatter

Custom argparse.HelpFormatter

presto.Commandline.checkArgs(parser)

Checks that arguments have been provided and prints help if they have not.

Parameters:parser – An argparse.ArgumentParser defining the commandline arguments.
Returns:True if arguments are present. Prints help and exits if not.
Return type:boolean
presto.Commandline.getCommonArgParser(seq_in=True, seq_out=True, seq_paired=False, db_in=False, db_out=False, out_file=True, failed=True, log=True, annotation=True, multiproc=False, add_help=True)

Defines an ArgumentParser object with common pRESTO arguments

Parameters:
  • seq_in (bool) – if True include sequence input arguments.
  • seq_out (bool) – if True include sequence output arguments.
  • seq_paired (bool) – if True defined paired-end sequence input and output arguments.
  • db_in (bool) – if True include tab-delimited database input arguments.
  • db_out (bool) – if True include tab-delimited database output arguments.
  • out_file (bool) – if True add explicit output file name arguments.
  • failed (bool) – If True include arguments for output of failed results.
  • log (bool) – If True include log arguments.
  • annotation (bool) – If True include annotation arguments.
  • multiproc (bool) – If True include multiprocessing arguments.
  • add_help (bool) – If True add help and version arguments.
Returns:

ArgumentParser object.

Return type:

argparse.ArgumentParser

presto.Commandline.parseCommonArgs(args, in_arg=None, in_types=None)

Checks common arguments from getCommonArgParser and transforms output options to a dictionary

Parameters:
  • args – Argument Namespace defined by ArgumentParser.parse_args
  • in_arg – String defining a non-standard input file argument to verify; by default [‘db_files’, ‘seq_files’, ‘seq_files_1’, ‘seq_files_2’, ‘primer_file’] are supported in that order
  • in_types – List of types (file extensions as strings) to allow for files in file_arg if None do not check type
Returns:

Dictionary copy of args with output arguments embedded in the dictionary out_args

Return type:

dict

presto.IO

File I/O and logging functions

presto.IO.countSeqFile(seq_file)

Counts the records in FASTA/FASTQ files

Parameters:seq_file – FASTA or FASTQ file containing sample sequences
Returns:Count of records in the sequence file
Return type:int
presto.IO.countSeqSets(seq_file, field='BARCODE', delimiter=('|', '=', ', '))

Identifies sets of sequences with the same ID field

Parameters:
  • seq_file – FASTA or FASTQ file containing sample sequences
  • field – Annotation field containing set IDs
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Count of unit set IDs in the sequence file

Return type:

int

presto.IO.getFileType(filename)

Determines the type of a file by file extension

Parameters:filename – Filename
Returns:String defining the sequence type for SeqIO operations
Return type:str
presto.IO.getOutputHandle(in_file, out_label=None, out_dir=None, out_name=None, out_type=None)

Opens an output file handle

Parameters:
  • in_file – Input filename
  • out_label – Text to be inserted before the file extension; if None do not add a label
  • out_type – the file extension of the output file; if None use input file extension
  • out_dir – the output directory; if None use directory of input file
  • out_name – the short filename to use for the output file; if None use input file short name
Returns:

File handle

Return type:

file

presto.IO.printCount(current, step, start_time=None, task=None, end=False)

Prints a progress bar to standard out

Parameters:
  • current (int) – count of completed tasks.
  • step (int) – an int defining the progress increment to print at.
  • start_time (time.time) – task start time returned by time.time(); if None do not add run time to progress
  • task (str) – name of task to display.
  • end (bool) – if True print final log (add newline).
presto.IO.printDebug(message, debug=True)

Prints a debug message to standard error

Parameters:
  • message (str) – message.
  • debug (bool) – if True print the message.
presto.IO.printError(message, exit=True)

Prints an error to standard error and exits

Parameters:
  • message (str) – error message.
  • exit (bool) – if True exit after the message.
presto.IO.printLog(record, handle=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, inset=None)

Formats a dictionary into a log string

Parameters:
  • record – a dict or OrderedDict of field names mapping to values.
  • handle – the file handle to write the log to; if None do not write to file.
  • inset – minimum field name inset; if None automatically space field names.
Returns:

Formatted multi-line string in the log format.

Return type:

str

presto.IO.printMessage(message, start_time=None, end=False, width=25)

Prints a progress message to standard out

Parameters:
  • message – Current task message
  • start_time – task start time returned by time.time(); if None do not add run time to progress
  • end – If True print final message (add newline)
  • width – Maximum number of characters for messages
presto.IO.printProgress(current, total, step, start_time=None, task=None, end=False)

Prints a progress bar to standard out

Parameters:
  • current (int) – count of completed tasks.
  • total (int) – total task count.
  • step (float) – float defining the fractional progress increment to print at.
  • start_time (float) – task start time returned by time.time(); if None do not add run time to progress
  • task (str) – name of task to display.
  • end (bool) – if True print final log (add newline).
presto.IO.printWarning(message)

Prints a warning to standard error

Parameters:message (str) – warning message.
presto.IO.readPrimerFile(primer_file)

Processes primer sequences from file

Parameters:primer_file (str) – name of the FASTA file containing primer sequences.
Returns:Dictionary mapping primer ID to sequence.
Return type:dict
presto.IO.readReferenceFile(ref_file)

Create a dictionary of cleaned and ungapped reference sequences.

Parameters:ref_file – reference sequences in fasta format.
Returns:
cleaned and ungapped reference sequences;
with the key as the sequence ID and value as a Bio.SeqRecord for each reference sequence.
Return type:dict
presto.IO.readSeqFile(seq_file, index=False, key_func=None)

Reads FASTA/FASTQ files

Parameters:
  • seq_file – FASTA or FASTQ file containing sample sequences
  • index – If True return a dictionary from SeqIO.index(); if False return an iterator from SeqIO.parse()
  • key_func – the key_function argument to pass to SeqIO.index if index=True
Returns:

an interator of SeqRecords if index=False. A dict if True.

Return type:

iter

presto.Multiprocessing

Multiprocessing functions

class presto.Multiprocessing.SeqData(id, data)

Bases: object

Class defining sequence data objects for worker processes

id

unique identifier

data

single object or a list of data objects.

valid

if True data is suitable for processing.

__bool__()

Boolean evaluation

Returns:True if the valid attribute is True
Return type:bool
__len__()

Length evaluation

Returns:number of objects in the data attribute.
Return type:int
class presto.Multiprocessing.SeqResult(id, data)

Bases: object

Class defining sequence result objects for collector processes

id

unique identifier

data

single unprocessed object or a list of unprocessed data objects.

results

single processed object or a list of processed data objects.

valid

if True processing was successful.

log

dictionary containing the processing log.

__bool__()

Boolean evaluation

Returns:True if the valid attribute is True.
Return type:bool
__len__()

Length evaluation

Returns:number of objects in the results attribute.
Return type:int
data_count

Data length

Returns:number of objects in the data attribute.
Return type:int
presto.Multiprocessing.collectPairQueue(alive, result_queue, collect_queue, seq_file_1, seq_file_2, label, out_file=None, out_args={'delimiter': ('|', '=', ', '), 'failed': True, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': None, 'separator': ', '})

Pulls from results queue, assembles results and manages log and file IO

Parameters:
  • alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns.
  • result_queue – a multiprocessing.Queue holding worker results.
  • collect_queue – a multiprocessing.Queue holding collector return values.
  • seq_file_1 – the first sequence file name.
  • seq_file_2 – the second sequence file name.
  • label – task label used to tag the output files.
  • out_file – output file name. Automatically generated from the input file if None.
  • out_args – common output argument dictionary from parseCommonArgs.
Returns:

adds a dictionary of {log: log object, out_files: output file names} to collect_queue.

Return type:

None

presto.Multiprocessing.collectSeqQueue(alive, result_queue, collect_queue, seq_file, label, index_field=None, out_file=None, out_args={'delimiter': ('|', '=', ', '), 'failed': True, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': None, 'separator': ', '})

Pulls from results queue, assembles results and manages log and file IO

Parameters:
  • alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns.
  • result_queue – Multiprocessing.Queue holding worker results.
  • collect_queue – Multiprocessing.Queue to store collector return values.
  • seq_file – sample sequence file name.
  • label – task label used to tag the output files.
  • out_file – output file name. Automatically generated from the input file if None.
  • out_args – Common output argument dictionary from parseCommonArgs.
  • index_field – Field defining set membership for sequence sets if None data queue contained individual records.
Returns:

Adds a dictionary with key value pairs to collect_queue containing

’log’ defining a log object, ‘out_files’ defining the output file names

Return type:

None

presto.Multiprocessing.feedPairQueue(alive, data_queue, seq_file_1, seq_file_2, coord_type='presto', delimiter=('|', '=', ', '))

Feeds the data queue with sequence pairs for processQueue processes

Parameters:
  • alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns
  • data_queue – an multiprocessing.Queue to hold data for processing
  • seq_file_1 – the name of sequence file 1
  • seq_file_2 – the name of sequence file 2
  • coord_type – the sequence header format
  • delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:

None

presto.Multiprocessing.feedSeqQueue(alive, data_queue, seq_file, index_func=None, index_args={})

Feeds the data queue with SeqRecord objects

Parameters:
  • alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns
  • data_queue – multiprocessing.Queue to hold data for processing
  • seq_file – Sequence file to read input from
  • index_func – Function to use to define sequence sets if None do not index sets and feed individual records
  • index_args – Dictionary of arguments to pass to index_func
Returns:

None

presto.Multiprocessing.manageProcesses(feed_func, work_func, collect_func, feed_args={}, work_args={}, collect_args={}, nproc=None, queue_size=None)

Manages feeder, worker and collector processes

Parameters:
  • feed_func (function) – Data Queue feeder function.
  • work_func (function) – Worker function.
  • collect_func (function) – Result Queue collector function.
  • feed_args (dict) – Dictionary of arguments to pass to feed_func.
  • work_args (dict) – Dictionary of arguments to pass to work_func.
  • collect_args (dict) – Dictionary of arguments to pass to collect_func.
  • nproc (int) – Number of processQueue processes; if None defaults to the number of CPUs
  • queue_size (int) – Maximum size of the argument queue; if None defaults to 2*nproc
Returns:

Dictionary of collector results

Return type:

dict

presto.Multiprocessing.processSeqQueue(alive, data_queue, result_queue, process_func, process_args={})

Pulls from data queue, performs calculations, and feeds results queue

Parameters:
  • alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns
  • data_queue – multiprocessing.Queue holding data to process
  • result_queue – multiprocessing.Queue to hold processed results
  • process_func – function to use for processing sequences
  • process_args – Dictionary of arguments to pass to process_func
Returns:

None

presto.Sequence

Sequence processing functions

class presto.Sequence.AssemblyRecord(seq=None)

Bases: object

A class defining a paired-end assembly result

overlap
class presto.Sequence.AssemblyStats(n)

Bases: object

Class containing p-value and z-score matrices for scoring assemblies

class presto.Sequence.PrimerAlignment(seq=None)

Bases: object

A class defining a primer alignment result

__bool__()

Boolean evaluation of the alignment

Returns:evaluates to the value of the valid attribute
Return type:int
__len__()

Length of alignment

Returns:length of align_seq attribute
Return type:int
presto.Sequence.alignAssembly(head_seq, tail_seq, alpha=1e-05, max_error=0.3, min_len=8, max_len=1000, scan_reverse=False, assembly_stats=None, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Stitches two sequences together by aligning the ends

Parameters:
  • head_seq – the head SeqRecord.
  • head_seq – the tail SeqRecord.
  • alpha – the minimum p-value for a valid assembly.
  • max_error – the maximum error rate for a valid assembly.
  • min_len – minimum length of overlap to test.
  • max_len – maximum length of overlap to test.
  • scan_reverse – if True allow the head sequence to overhang the end of the tail sequence if False end alignment scan at end of tail sequence or start of head sequence.
  • assembly_stats – optional successes by trials numpy.array of p-values.
  • score_dict – optional dictionary of character scores in the . form {(char1, char2): score}.
Returns:

assembled sequence object.

Return type:

AssemblyRecord

presto.Sequence.calculateDiversity(seq_list, score_dict=getDNAScoreDict())

Determine the average pairwise error rate for a list of sequences

Parameters:
  • seq_list – List of SeqRecord objects to score
  • score_dict – Optional dictionary of alignment scores as {(char1, char2): score}
Returns:

Average pairwise error rate for the list of sequences

Return type:

float

presto.Sequence.calculateSetError(seq_list, ref_seq, ignore_chars=['n', 'N'], score_dict=getDNAScoreDict())

Counts the occurrence of nucleotide mismatches from a reference in a set of sequences

Parameters:
  • seq_list – list of SeqRecord objects with aligned sequences.
  • ref_seq – SeqRecord object containing the reference sequence to match against.
  • ignore_chars – list of characters to exclude from mismatch counts.
  • score_dict – optional dictionary of alignment scores as {(char1, char2): score}.
Returns:

error rate for the set.

Return type:

float

presto.Sequence.checkSeqEqual(seq1, seq2, ignore_chars={'-', '.', 'N', 'n'})

Determine if two sequences are equal, excluding missing positions

Parameters:
  • seq1 – SeqRecord object
  • seq2 – SeqRecord object
  • ignore_chars – Set of characters to ignore
Returns:

True if the sequences are equal

Return type:

bool

presto.Sequence.compilePrimers(primers)

Translates IUPAC Ambiguous Nucleotide characters to regular expressions and compiles them

Parameters:key – Dictionary of sequences to translate
Returns:Dictionary of compiled regular expressions
Return type:dict
presto.Sequence.consensusUnify(data, field, delimiter=('|', '=', ', '))

Reassigns all annotations to the consensus annotation in group

Parameters:
  • data – SeqData object contain sequences to process.
  • field – field containing annotations to collapse.
  • delimiter – a tuple of delimiters for (annotations, field/values, value lists).
Returns:

modified sequences.

Return type:

SeqResult

presto.Sequence.deleteSeqPositions(seq, positions)

Deletes a list of positions from a SeqRecord

Parameters:
  • seq – SeqRecord objects
  • positions – Set of positions (indices) to delete
Returns:

Modified SeqRecord with the specified positions removed

Return type:

SeqRecord

presto.Sequence.deletionUnify(data, field, delimiter=('|', '=', ', '))

Removes all sequences with differing annotations in a group

Parameters:
  • data – SeqData object contain sequences to process.
  • field – field containing annotations to collapse.
  • delimiter – a tuple of delimiters for (annotations, field/values, value lists).
Returns:

modified sequences.

Return type:

SeqResult

presto.Sequence.extractAlignment(seq_record, start, length, rev_primer=False)

Extracts a subsequence from sequence

Parameters:
  • data – SeqRecord to process.
  • start – position where subsequence starts.
  • length – the length of the subsequence to extract.
  • rev_primer – if True extract relative to the tail end of the sequence.
Returns:

extraction results as an alignment object

Return type:

presto.Sequence.PrimerAlignment

presto.Sequence.filterLength(data, min_length=250, inner=True, missing_chars='N.n-')

Filters sequences by length

Parameters:
  • data (SeqData) – a SeqData object with a single SeqRecord to process.
  • min_length (int) – the minimum length allowed.
  • inner (bool) – if True exclude outer missing characters from calculation.
  • missing_chars (str) – a string of missing character values.
Returns:

SeqResult object.

Return type:

SeqResult

presto.Sequence.filterMissing(data, max_missing=10, inner=True, missing_chars='N.n-')

Filters sequences by number of missing nucleotides

Parameters:
  • data (SeqData) – SeqData object with a single SeqRecord to process.
  • max_missing (int) – the maximum number of allowed ambiguous characters.
  • inner (bool) – if True exclude outer missing characters from calculation.
  • missing_chars (str) – a string of missing character values.
Returns:

SeqResult object.

Return type:

SeqResult

presto.Sequence.filterQuality(data, min_qual=0, inner=True, missing_chars='N.n-')

Filters sequences by quality score

Parameters:
  • data (SeqData) – a SeqData object with a single SeqRecord to process.
  • min_qual (int) – minimum mean quality score for retained sequences.
  • inner (bool) – if True exclude outer missing characters from calculation.
  • missing_chars (str) – a string of missing character values.
Returns:

SeqResult object.

Return type:

SeqResult

presto.Sequence.filterRepeats(data, max_repeat=15, include_missing=False, inner=True, missing_chars='N.n-')

Filters sequences by fraction of ambiguous nucleotides

Parameters:
  • data (SeqData) – a SeqData object with a single SeqRecord to process.
  • max_repeat (int) – the maximum number of allowed repeating characters.
  • include_missing (int) – if True count ambiguous character repeats; if False do not consider ambiguous character repeats.
  • inner (int) – if True exclude outer missing characters from calculation.
  • missing_chars (str) – a string of missing character values.
Returns:

SeqResult object.

Return type:

SeqResult

presto.Sequence.findGapPositions(seq_list, max_gap, gap_chars={'-', '.'})

Finds positions in a set of aligned sequences with a high number of gap characters.

Parameters:
  • seq_list – List of SeqRecord objects with aligned sequences
  • max_gap – Float of the maximum gap frequency to consider a position as non-gapped
  • gap_chars – Set of characters to consider as gaps
Returns:

Positions (indices) with gap frequency greater than max_gap

Return type:

list

presto.Sequence.frequencyConsensus(seq_list, min_freq=0.6, ignore_chars={'-', '.', 'N', 'n'})

Builds a consensus sequence from a set of sequences

Parameters:
  • set_seq – List of SeqRecord objects
  • min_freq – Frequency cutoff to assign a base
  • ignore_chars – Set of characters to exclude when building a consensus sequence
Returns:

Consensus SeqRecord object

Return type:

SeqRecord

presto.Sequence.getAAScoreDict(mask_score=None, gap_score=None)

Generates a score dictionary

Parameters:
  • mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score dictionary with keys (char1, char2) mapping to scores

Return type:

dict

presto.Sequence.getDNAScoreDict(mask_score=None, gap_score=None)

Generates a score dictionary

Parameters:
  • mask_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score dictionary with keys (char1, char2) mapping to scores

Return type:

dict

presto.Sequence.indexSeqSets(seq_dict, field='BARCODE', delimiter=('|', '=', ', '))

Identifies sets of sequences with the same ID field

Parameters:
  • seq_dict – a dictionary index of sequences returned from SeqIO.index()
  • field – the annotation field containing set IDs
  • delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:

Dictionary mapping set name to a list of record names

Return type:

dict

presto.Sequence.joinAssembly(head_seq, tail_seq, gap=0, insert_seq=None)

Concatenates two sequences

Parameters:
  • head_seq – the head SeqRecord.
  • tail_seq – the tail SeqRecord.
  • gap – number of gap characters to insert between head and tail ignored if insert_seq is not None.
  • insert_seq – a string or Bio.Seq.Seq object, to insert between the head and tail; if None insert with N characters.
Returns:

assembled sequence object.

Return type:

AssemblyRecord

presto.Sequence.localAlignment(seq_record, primers, primers_regex=None, max_error=0.3, max_len=1000, rev_primer=False, skip_rc=False, gap_penalty=(1, 1), score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 0, ('N', 'B'): 0, ('N', 'C'): 0, ('N', 'D'): 0, ('N', 'G'): 0, ('N', 'H'): 0, ('N', 'K'): 0, ('N', 'M'): 0, ('N', 'N'): 0, ('N', 'R'): 0, ('N', 'S'): 0, ('N', 'T'): 0, ('N', 'V'): 0, ('N', 'W'): 0, ('N', 'Y'): 0, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Performs pairwise local alignment of a list of short sequences against a long sequence

Parameters:
  • seq_record – a SeqRecord object to align primers against
  • primers – dictionary of {names: short IUPAC ambiguous sequence strings}
  • primers_regex – optional dictionary of {names: compiled primer regular expressions}
  • max_error – maximum acceptable error rate before aligning reverse complement
  • max_len – maximum length of sample sequence to align
  • rev_primer – if True align with the tail end of the sequence
  • skip_rc – if True do not check reverse complement sequences
  • gap_penalty – a tuple of positive (gap open, gap extend) penalties
  • score_dict – optional dictionary of alignment scores as {(char1, char2): score}
Returns:

primer alignment result object

Return type:

presto.Sequence.PrimerAlignment

presto.Sequence.maskQuality(data, min_qual=0)

Masks characters by in sequence by quality score

Parameters:
  • data (SeqData) – a SeqData object with a single SeqRecord to process.
  • min_qual (int) – minimum quality for retained characters.
Returns:

SeqResult object.

Return type:

SeqResult

presto.Sequence.maskSeq(align, mode='mask', barcode=False, barcode_field='BARCODE', primer_field='PRIMER', delimiter=('|', '=', ', '))

Create an output sequence with primers masked or cut

Parameters:
  • align – a PrimerAlignment object.
  • mode – defines the action taken; one of ‘cut’, ‘mask’, ‘tag’ or ‘trim’.
  • barcode – if True add sequence preceding primer to description.
  • barcode_field – name of the output barcode annotation.
  • primer_field – name of the output primer annotation.
  • delimiter – a tuple of delimiters for (annotations, field/values, value lists).
Returns:

masked sequence.

Return type:

Bio.SeqRecord.SeqRecord

presto.Sequence.overlapConsensus(head_seq, tail_seq, ignore_chars={'-', '.', 'N', 'n'})

Creates a consensus overlap sequences from two segments

Parameters:
  • head_seq – the overlap head SeqRecord.
  • tail_seq – the overlap tail SeqRecord.
  • ignore_chars – list of characters which do not contribute to consensus.
Returns:

A SeqRecord object with consensus characters and quality scores.

Return type:

SeqRecord

presto.Sequence.qualityConsensus(seq_list, min_qual=0, min_freq=0.6, dependent=False, ignore_chars={'-', '.', 'N', 'n'})

Builds a consensus sequence from a set of sequences

Parameters:
  • seq_list – List of SeqRecord objects
  • min_qual – Quality cutoff to assign a base
  • min_freq – Frequency cutoff to assign a base
  • dependent – If False assume sequences are independent for quality calculation
  • ignore_chars – Set of characters to exclude when building a consensus sequence
Returns:

Consensus SeqRecord object

Return type:

SeqRecord

presto.Sequence.referenceAssembly(head_seq, tail_seq, ref_dict, ref_db, min_ident=0.5, evalue=1e-05, max_hits=100, fill=False, aligner='usearch', aligner_exec='usearch', score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Stitches two sequences together by aligning against a reference database

Parameters:
  • head_seq – the head SeqRecord.
  • head_seq – the tail SeqRecord.
  • ref_dict – a dictionary of reference SeqRecord objects.
  • ref_db – the path and name of the reference database.
  • min_ident – the minimum identity for a valid assembly.
  • evalue – the E-value cut-off for ublast.
  • max_hits – the maxhits output limit for ublast.
  • fill – if False non-overlapping regions will be assigned Ns; if True non-overlapping regions will be filled with the reference sequence.
  • aligner – the alignment tool; one of ‘blastn’ or ‘usearch’.
  • aligner_exec – the path to the alignment tool executable.
  • score_dict – optional dictionary of character scores in the form {(char1, char2): score}.
Returns:

assembled sequence object.

Return type:

AssemblyRecord

presto.Sequence.reverseComplement(seq)

Takes the reverse complement of a sequence

Parameters:seq – a SeqRecord object, Seq object or string to reverse complement
Returns:Object of the same type as the input with the reverse complement sequence
Return type:Seq
presto.Sequence.scoreAA(a, b, mask_score=None, gap_score=None)

Returns the score for a pair of IUPAC Extended Protein characters

Parameters:
  • a – First character
  • b – Second character
  • mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score for the character pair

Return type:

int

presto.Sequence.scoreAlignment(seq_record, primers, start=0, rev_primer=False, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 0, ('N', 'B'): 0, ('N', 'C'): 0, ('N', 'D'): 0, ('N', 'G'): 0, ('N', 'H'): 0, ('N', 'K'): 0, ('N', 'M'): 0, ('N', 'N'): 0, ('N', 'R'): 0, ('N', 'S'): 0, ('N', 'T'): 0, ('N', 'V'): 0, ('N', 'W'): 0, ('N', 'Y'): 0, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Performs a simple fixed position alignment of primers

Parameters:
  • seq_record – a SeqRecord object to align primers against
  • primers – dictionary of {names: short IUPAC ambiguous sequence strings}
  • start – position where primer alignment starts
  • rev_primer – if True align with the tail end of the sequence
  • score_dict – optional dictionary of {(char1, char2): score} alignment scores
Returns:

primer alignment result object

Return type:

presto.Sequence.PrimerAlignment

presto.Sequence.scoreDNA(a, b, mask_score=None, gap_score=None)

Returns the score for a pair of IUPAC Ambiguous Nucleotide characters

Parameters:
  • a – First characters
  • b – Second character
  • n_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score for the character pair

Return type:

int

presto.Sequence.scoreSeqPair(seq1, seq2, ignore_chars={}, score_dict={('-', '-'): 1, ('-', '.'): 1, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 1, ('.', '.'): 1, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Determine the error rate for a pair of sequences

Parameters:
  • seq1 – SeqRecord object
  • seq2 – SeqRecord object
  • ignore_chars – Set of characters to ignore when scoring and counting the weight
  • score_dict – Optional dictionary of alignment scores
Returns:

Tuple of the (score, minimum weight, error rate) for the pair of sequences

Return type:

Tuple

presto.Sequence.sequentialAssembly(head_seq, tail_seq, ref_dict, ref_db, alpha=1e-05, max_error=0.3, min_len=8, max_len=1000, scan_reverse=False, min_ident=0.5, evalue=1e-05, max_hits=100, fill=False, aligner='usearch', aligner_exec='usearch', assembly_stats=None, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Stitches sequences together by first attempting de novo assembly then falling back to reference guided assembly

Parameters:
  • head_seq – the head SeqRecord
  • head_seq – the tail SeqRecord
  • ref_dict – a dictionary of reference SeqRecord objects
  • ref_db – the path and name of the reference database
  • alpha – the minimum p-value for a valid de novo assembly
  • max_error – the maximum error rate for a valid de novo assembly
  • min_len – minimum length of overlap to test for de novo assembly
  • max_len – maximum length of overlap to test for de novo assembly
  • scan_reverse – if True allow the head sequence to overhang the end of the tail sequence in de novo assembly if False end alignment scan at end of tail sequence or start of head sequence
  • min_ident – the minimum identity for a valid reference guided assembly
  • evalue – the E-value cut-off for reference guided assembly
  • max_hits – the maxhits output limit for reference guided assembly
  • fill – if False non-overlapping regions will be assigned Ns in reference guided assembly; if True non-overlapping regions will be filled with the reference sequence.
  • aligner – the alignment tool; one of ‘blastn’ or ‘usearch’
  • aligner_exec – the path to the alignment tool executable
  • assembly_stats – optional successes by trials numpy.array of p-values
  • score_dict – optional dictionary of character scores in the form {(char1, char2): score}.
Returns:

assembled sequence object.

Return type:

AssemblyRecord

presto.Sequence.subsetSeqIndex(seq_dict, field, values, delimiter=('|', '=', ', '))

Subsets a sequence set by annotation value

Parameters:
  • seq_dict – Dictionary index of sequences returned from SeqIO.index()
  • field – Annotation field to select keys by
  • values – List of annotation values that define the retained keys
  • delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns:

List of keys

Return type:

list

presto.Sequence.subsetSeqSet(seq_iter, field, values, delimiter=('|', '=', ', '))

Subsets a sequence set by annotation value

Parameters:
  • seq_iter – Iterator or list of SeqRecord objects
  • field – Annotation field to select by
  • values – List of annotation values that define the retained sequences
  • delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns:

Modified list of SeqRecord objects

Return type:

list

presto.Sequence.translateAmbigDNA(key)

Translates IUPAC Ambiguous Nucleotide characters to or from character sets

Parameters:key – String or re.search object containing the character set to translate
Returns:Character translation
Return type:str
presto.Sequence.trimQuality(data, min_qual=0, window=10, reverse=False)

Cuts sequences using a moving mean quality score

Parameters:
  • data (SeqData) – a SeqData object with a single SeqRecord to process.
  • min_qual (int) – minimum mean quality to define a cut point.
  • window (int) – nucleotide window size.
  • reverse (bool) – if True cut the head of the sequence; if False cut the tail of the sequence
Returns:

SeqResult object.

Return type:

SeqResult

presto.Sequence.weightSeq(seq, ignore_chars={})

Returns the length of a sequencing excluding ignored characters

Parameters:
  • seq – SeqRecord or Seq object
  • ignore_chars – Set of characters to ignore when counting sequence length
Returns:

Sum of the character scores for the sequence

Return type:

int

Release Notes

Version 0.5.13: August 29, 2019

  • Fixed .ix pandas deprecation warning.

Version 0.5.12: August 5, 2019

  • Increased pandas version requirement to v0.24+ with compatibility fixes for pandas v0.24.0.

Version 0.5.11: January 30, 2019

  • Slightly changed version number display in commandline help.

ClusterSets:

  • Removed the --log and --failed arguments from the all and barcode subcommands because they do nothing.
  • Added the --length argument to all subcommands which defines the ratio of minimum sequence lengths allowed in a cluster.
  • Added error handling for when --ident is less than recommended value.
  • Increased maximum memory allocation for cd-hit-est to 3GB.

Version 0.5.10: October 19, 2018

  • Documentation added for UMI error correction.
  • Added IO.printDebug to the API.

EstimateError:

  • Added the subcommands set and barcode where set is the previous error estimation method (using UMI read groups). The new barcode method generates pairwise Hamming distance distributions for sequences containied in header annotations (typically UMI barcode sequences).
  • Added distance and threshold output files containing pairwise distance histograms and clustering thresholds for use with the corresponding subcommands in ClusterSets.
  • Increased default minimum sequence count in the set subcommand to 20.

Version 0.5.9: September 2, 2018

  • Add the -o argument to most tools, which allows explicit declaration of the output file name.
  • Added IO.printWarning and IO.printError API functions for handling standard error messaging.
  • Split IO.printProgress API function into IO.printProgress (percentage) and IO.printCount (raw counts).
  • Moved a significant number of functions and classes from the executable scripts into the API.

MaskPrimers:

  • Removed support for the regex primer file format.

AssemblePairs:

  • Changed default of the --rc argument to tail and added none option for previous default.

Version 0.5.8: July 13, 2018

  • Fixed installation incompatibility with pip 10.

ClusterSets:

  • Added support for CD-HIT.

EstimateError:

  • Significant performance improvements.

Version 0.5.7: March 19, 2018

BuildConsensus:

  • Fixed an error wherein the program would exit if all sequences in an UMI read group had a Phred quality score of 0 in a given position.

ConvertHeaders:

  • Added support for EMBL-EBI ENA header format to the sra subcommand.

MaskPrimers:

  • Added extract subcommand which will remove/annotation subsequences in fixed position without requiring a primer sequence match.
  • Added --pf and --bf arguments to all subcommands allowing renaming of output PRIMER and BARCODE fields, respectively.
  • Removed SEQORIENT output field from score subcommand as the mode does not check the reverse complement.

PairSeq:

  • Added the --act argument to provide a mechanism for collapsing values of duplicate fields copied across files.

ParseHeaders:

  • Added merge subcommand to combined separate annotations into a single entry.

Version 0.5.6: January 17, 2018

CollapseSeq:

  • Fixed a bug causing copy fields (--cf argument) to be processed incorrectly.

UnifyHeaders:

  • Improved reporting in log file.

Version 0.5.5: December 26, 2017

AssemblePairs:

  • Fixed a bug that caused the align subcommand to error if input sequences where shorter than the minimum specified by the –minlen argument. It will now simply fail such sequences.

ClusterSets:

  • Moved functionality of previous ClusterSets command into the set subcommand.
  • Added the all subcommand to cluster all sequences without considering annotation groups.
  • Added the barcode subcommand which allows for clustering of reads based on a barcode sequence instead of the read data.
  • Renamed -id argument to --ident for consistency with AssemblePairs.

CollapseSeq:

  • Fixed a bug wherein CollapseSeq would match partial sequences against longer sequences that were otherwise identical up until the missing end characters.
  • Added detailed log output.

EstimateError:

  • Fixed a division by zero warning in the log output when there were no observed mismatches.

UnifyHeaders:

  • New tool to generate consensus annotations or filter reads based on annotation groupings.

Version 0.5.4: July 1, 2017

  • All tools will now print detailed help if no arguments are provided.

AlignSets:

  • Fixed a typo in the console log of the muscle subcommand.

ConvertHeaders:

  • Added the migec subcommand to convert headers from the MIGEC tool.

EstimateError:

  • Fixed a division by zero error when there were no observed mismatches.
  • Bounded error rate to a minimum of 10^-9 (Q=90).

Version 0.5.3: February 14, 2017

License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

AssemblePairs:

  • Changed the behavior of the --failed argument so that failed output are in the same orientation as the input sequences. Meaning, the --rc argument is ignored for failed output.
  • Added the sequential subcommand which will first attempt de novo assembly (align subcommand) following by reference guided assembly (reference subcommand) if de novo assembly fails.
  • Added blastn compatibility to reference subcommand.
  • Added the option --aligner to the reference subcommand to allow use of either blastn or usearch for performing the local alignment. Defaults to the usearch algorithm used in previous releases.
  • Added the option --dbexec to the reference subcommand to allow specification of the reference database build tool (eg, makeblastdb).
  • Changed masking behavior to none and word length to 9 in reference subcommand when using usearch as the aligner.
  • Internal modifications to the reference subcommand to rebuild the database before alignments for performance reasons.
  • Fixed a deprecation warning appearing with newer versions of numpy.

BuildConsensus:

  • Fixed a bug in the read group error rate calculation wherein either a consensus sequence or read group that was completely N characters would cause the program to exit with a division by zero error. Now, such non-informative read groups will be assigned an error rate of 1.0.

ClusterSets:

  • Added vsearch compatibility.
  • Fixed a bug wherein sets containing empty sequences were being fed to usearch, rather than automatically failed, which would cause usearch v8 to hang indefinitely.
  • Fixed an incompatibility with usearch v9 due to changes in the way usearch outputs sequence labels.
  • Changed masking behavior of usearch to none.
  • Changed how gaps are handling before passing sequences to usearch. Gaps are now masked (with Ns) for clustering, instead of removed.

EstimateError:

  • Fixed a fatal error with newer versions of pandas.

SplitSeq:

  • Added the select subcommand, which allows filtering of sequences based on annotation value matches or mismatches.
  • Altered the behavior of the -u argument for both the sample and samplepair subcommands. If -u is specified, sampling is performed as in previous versions wherein samples will be drawn from only fields with the specified annotation values up to n total reads. However, if -u is not specified with -f repeated sampling will now be performed for each unique annotation value in the specified field, generating output with up to n reads per unique annotation value.

Version 0.5.2: March 8, 2016

Fixed a bug with installation on Windows due to old file paths lingering in presto.egg-info/SOURCES.txt.

Improvements to commandline usage help messages.

Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0.

AssemblePairs:

  • Added the flag --fill to the reference subcommand to allow insertion of the reference sequence into the non-overlapping region of assembled sequences. Use caution when using this flag, as this may lead to chimeric sequences.
  • Changed default --minlen to 8 in align subcommand.

Version 0.5.1: December 4, 2015

ClusterSets:

  • Fixed bug wherein --failed flag did not work.

Version 0.5.0: September 7, 2015

Conversion to a proper Python package which uses pip and setuptools for installation.

The package now requires Python 3.4. Python 2.7 is not longer supported.

The required dependency versions have been bumped to numpy 1.8, scipy 0.14, pandas 0.15, and biopython 1.65.

IgCore:

  • Divided IgCore functionality into the separate modules: Annotation, Commandline, Defaults, IO, Multiprocessing and Sequence.

Version 0.4.8: September 7, 2015

Added support for additional input FASTA (.fna, .fa), FASTQ (.fq) and tab-delimited (.tsv) file extensions.

ParseHeaders:

  • Fixed a bug in the rename subcommand wherein renaming to an existing field deleted the old annotation, but did not merge the renamed annotation into the existing field.
  • Added the copy subcommand which will copy annotations into new field names or merge the annotations of existing fields.
  • Added the --act argument to the copy and rename subcommands allowing collapse following the copy or rename operation.
  • Added a commandline check to ensure that the -f, -k and --act arguments contain the same number of fields for both the rename and copy subcommands.

Version 0.4.7: June 5, 2015

IgCore:

  • Modified scoring functions to permit asymmetrical scores for N and gap characters.

AssemblePairs:

  • Added support for SRA style coordinate information where the where the read number has been appended to the spot number.
  • Altered scoring so gap characters are counted as mismatches in the error rate and identity calculations.

BuildConsensus:

  • Altered scoring so gap characters are counted as mismatches in the diversity and error rate calculations.

ConvertHeaders:

  • Added support for SRA style sequence headers where the read number has been appended to the spot number; eg, output from fastq-dump -I --split-files file.sra.

ClusterSets:

  • Added missing OUTPUT console log field.
  • Changed --bf and --cf arguments to -f and -k, respectively.

MaskPrimers:

  • Altering scoring behavior for N characters such that Ns in the input sequence are always counted as a mismatch, while Ns in the primer sequence are counted as a match, with priority given to the input sequence score.
  • Added --gap argument to the align subcommand which allows users to specify the gap open and gap extension penalties for aligning primers. Note: gap penalties reduce the match count for purposes of calculating ERROR.

PairSeq:

  • Added support for SRA style coordinate information where the where the read number has been appended to the spot number.

Version 0.4.6: May 13, 2015

BuildConsensus:

  • Changed --maxmiss argument to --maxgap and altered the behavior to only perform deletion of positions based on gap characters (only “-” or “.” and not “N” characters).
  • Added an error rate (--maxerror) calculation based on mismatches from consensus. The --maxerror argument is mutually exclusive with the --maxdiv argument and provides similar functionality. However, the calculations are not equivalent, and --maxerror should be considerably faster than --maxdiv.
  • Added exclusion of positions from the error rate calculation that are deleted due to exceeding the --maxgap threshold .
  • Fixed misalignment of consensus sequence against input sequences when positions are deleted due to exceeding the --maxgap threshold.

ClusterSets:

  • New script to cluster read groups by barcode field (eg, UID barcodes) into clustering within the read group.

ConvertHeaders:

  • New script to handle conversion of different sequence description formats to the pRESTO format.

FilterSeq:

  • Added count of masked characters to log output of maskqual subcommand.
  • Changed repeats subcommand log field REPEAT to REPEATS.

PairSeq:

  • Changed -f argument to --1f argument.
  • Added --2f argument to copy file 2 annotations to file 1.

ParseHeaders:

  • Moved convert subcommand to the generic subcommand of the new ConvertHeaders script and modified the conversion behavior.

Version 0.4.5: March 20, 2015

Added details to the usage documentation for each tool which describes both the output files and annotation fields.

Renamed --clean argument to --failed argument with opposite behavior, such that the default behavior of all scripts is now clean output.

IgCore:

  • Features added for Change-O compatibility.
  • Features added for PairSeq performance improvements.
  • Added custom help formatter.
  • Modifications to internals of multiprocessing code.
  • Fixed a few typos in error messages.

AssemblePairs:

  • Added reference subcommand which uses V-region germline alignments from ublast to assemble paired-ends.
  • Removed mate-pair matching operation to increase performance. Now requires both input files to contain matched and uniformly ordered reads. If files are not synchronized then PairSeq must be run first. AssemblePairs will check that coordinate info matches and error if the files are not synchronized. Unpaired reads are no longer output.
  • Added support for cases where one mate pair is the subsequence of the other.
  • Added --scanrev argument to allow for head sequence to overhang end of tail.
  • Removed truncated (quick) error calculation in align subcommand.
  • Changed default values of the --maxerror and --alpha arguments of the align subcommand to better tuned parameters.
  • Changed internal selection of top scoring alignment to use Z-score approximation rather than a combination of error rate and binomial mid-p value.
  • Internal changes to multiprocessing structure.
  • Changed inserted gap character from - to N in join subcommand for better compatibility with the behavior of IMGT/HighV-QUEST.
  • Changed PVAL log field to PVALUE.
  • Changed HEADSEQ and TAILSEQ log fields to SEQ1 and SEQ2.
  • Changed HEADFIELDS and TAILFIELDS log fields to FIELDS1 and FIELDS2.
  • Changed precision of ERROR and PVALUE log fields.
  • Added more verbose logging.

BuildConsensus:

  • Fixed bug where low quality positions where not being masked in single sequence barcode groups.
  • Added copy field (--cf) and copy action (--act) arguments to generate consensus annotations for barcode read groups.
  • Changed maximum consensus quality score from 93 to 90.

CollapseSeq:

  • Added --keep argument to allow retention of sequences with high missing character counts in unique sequence output file.
  • Removed case insensitivity for performance reasons. Now requires all sequences to have matching case.
  • Removed first and last from --act choices to avoid unexpected behavior.

MaskPrimers:

  • Changed behavior of N characters in primer identification. Ns now count as a match against any character, rather than a mismatch.
  • Changed behavior of mask mode such that positions masked with Ns are now assigned quality scores of 0, rather than retaining their previous scores.
  • Fixed a bug with the align subcommand where deletions within the input sequence (gaps in the alignment) were causing an incorrect barcode start position.

PairSeq:

  • Performance improvements. The tool should now be considerably faster on very large files.
  • Specifying the --failed argument to request output of sequences which do not have a mate pair will increase run time and memory usage.

ParseHeaders:

  • Add ‘cat’ action to collapse subcommand which concatenates strings into a single annotation.

SplitSeq:

  • Removed --clean (and --failed) flag from all subcommands.
  • Added progress updates to sample and samplepair subcommands.
  • Performance improvements to samplepair subcommand.

Version 0.4.4: June 10, 2014

SplitSeq:

  • Removed a linux-specific dependency, allowing SplitSeq to work on Windows.

Version 0.4.3: April 7, 2014

CollapseSeq:

  • Fixed bug that occurs with Python 2.7.5 on OS X.

SplitSeq:

  • Fixed bug in samplepairs subcommand that occurs with Python 2.7.5 on OS X.

Version 0.4.2: March 20, 2014

Increased verbosity of exception reporting.

IgCore:

  • Updates to consensus functions to support changes to BuildConsensus.

AssemblePairs:

  • Set default alpha to 0.01.

BuildConsensus:

  • Added support for --freq value parameter to quality consensus method and set default value to 0.6.
  • Fixed a bug in the frequency consensus method where missing values were contributing to the total character count at each position.
  • Added the parameter --maxmiss value which provides a cut-off for removal of positions with too many N or gap characters .

MaskPrimers:

  • Renamed the --reverse parameter to --revpr.

SplitSeq:

  • Removed convert subcommand.

Version 0.4.1: January 27, 2014

Changes to the internals of multiple tools to provide support for multiprocessing in Windows environments.

Changes to the internals of multiple tools to provide clean exit of child processes upon kill signal or exception in sibling process.

Fixed unexpected behavior of --outname and --log arguments with multiple input files.

IgCore:

  • Added reporting of unknown exceptions when reading sequence files
  • Fixed scoring of lowercase sequences.

AlignSets:

  • Fixed a typo in the log output.

BuildConsensus:

  • Fixed a typo in the log output.

EstimateError:

  • Fixed bug where tool would improperly exit if no sets passed threshold criteria.
  • Fixed typo in console output.

MaskPrimers:

  • Added trim mode which will cut the region before primer alignment, but leave primer region unmodified.
  • Fixed a bug with lowercase sequence data.
  • Fixed bug in the console and log output.
  • Added support for primer matching when setting --maxerr 1.0.

ParseHeaders:

  • Added count of sequences without any valid fields (FAIL) to console output.

ParseLog:

  • Added count of records without any valid fields (FAIL) to console output.

SplitSeq:

  • Fixed typo in console output of samplepair subcommand.
  • Added increase of the open file limit to the group subcommand to allow for a large number of groups.

Version 0.4.0: September 30, 2013

Minor name changes were made to multiple scripts, functions, parameters, and output files.

AlignSets, AssemblePairs, BuildConsensus, EstimateError, FilterSeq, and MaskPrimers are now multithreaded. The number of simultaneous processes may be specified using --nproc value. Note this means file ordering is no longer preserved between the input and output sequence files.

Performance improvements were made to several tools.

The universal --verbose parameter was replaced with --log file_name which specifies a log file for verbose output, and disables verbose logging if not specified.

The report of input parameters and sequence counts is now separate from the log and is always printed to standard output.

Added a progress bar to the standard output of most tools.

Added a universal --outname file_prefix parameter which changes the leading portion of the output file name. If not specified, the current file name is used (excluding the file extension, as per the previous behavior).

Added a universal --clean parameter which if specified forces the tool not to create an output file of sequences which failed processing.

IgCore:

  • Changes to parameters and internals of multiple functions.
  • Added functions to support multithreading for single-end reads, paired-end reads, and barcode sets.
  • Added safe annotation field renaming.
  • Added progress bar, logging and output file name conversion support.
  • Moved reusable AssemblePairs, BuildConsensus, PairSeq, and SplitSeq. operations into IgCore.

AssemblePairs:

  • Coordinate information is now specified by a coordinate type, rather than a delimiter, using the --coord header_type parameter, where the header type may be one of illumina, solexa, sra, 454, presto.

CollapseSeq:

  • Sequences with a missing character count exceeding the user limit defined by -n maximum_missing_count are now exported to a separate collapse-undetermined output file, rather than included in the collapse-unique sequence output.

EstimateError:

  • Now outputs error estimations for positions, quality scores, nucleotide pairs, and annotation sets.
  • Machine reported quality scores and empirical quality scores have been added to all output tables.

FilterSeq:

  • Added length subcommand to filter sequences by minimum length.

PairSeq:

  • Coordinate information has been redefined as per AssemblePairs.

ParseHeaders:

  • Added new subcommand convert which attempts to reformat sequence headers into the pRESTO format.
  • The rename subcommand will now append entries if the new field name already exists in the sequence header, rather than replace the entry.

Version 0.3 (prerelease 6): August 13, 2013

Toolkit is now dependent upon pandas 0.12 for the estimateError tool.

alignSets:

  • Changed MUSCLE execution to faster settings (-diags, -maxiters 2).

filterQuality:

  • Added repeat subcommand to filter sequences with -n (value) repetitions of a single character and.
  • Changed -n parameter of ambig subcommand from fractional value to a raw count.

estimateError:

  • New tool which estimates error of sequence sets by comparison to a consensus.

maskPrimers:

  • Bug fixes to alignment position calculation of align subcommand when primer alignment begins before start of sequence.
  • Removed --ann parameter.

Version 0.3 (prerelease 5): August 7, 2013

License changed to Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

IgPipeline Core:

  • Bug fixes to diversity calculation.
  • Added support for files where all sequences do not share the same annotation fields.
  • Added support for alternate scoring of gap and N-valued nucleotides.

alignSets:

  • Added --mode parameter with options of pad and cut to specify whether to extend or trim read groups to the same start position.
  • Fixed intermittent ‘muscle’ subcommand stdout pipe deadlock when executing MUSCLE.

assemblePairs:

  • Added join subcommand to support library preps where paired-end reads do not overlap.
  • Speed improvements to p-value calculations.

buildConsensus:

  • --div parameter converted to --maxdiv value to allow filtering of read groups by diversity.
  • Bug fixes to nucleotide frequency consensus method.
  • -q parameter renamed to --qual.

collapseSequences:

  • Added support for files where all sequences do not share the same annotation fields.

splitSeqFile:

  • samplepair subcommand added to allow random sampling from paired-end file sets.
  • The behavior of the -c parameter of the sample and samplepair subcommands changed to allow multiple samplings with the same command.

Version 0.3 (prerelease 4): May 18, 2013

Initial public prerelease

Contact

If you have questions you can email the Immcantation Group.

If you’ve discovered a bug or have a feature request, you can create an issue on Bitbucket using the Issue Tracker.

Citation

To cite pRESTO in publications please use:

Vander Heiden JA*, Yaari G*, Uduman M, Stern JNH, O’Connor KC, Hafler DA, Vigneault F, Kleinstein SH. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 2014; doi: 10.1093/bioinformatics/btu138

License

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.