https://img.shields.io/pypi/dm/presto https://img.shields.io/static/v1?label=AIRR-C%20sw-tools%20v1&message=compliant&color=008AFF&labelColor=000000&style=plastic

pRESTO - The REpertoire Sequencing TOolkit

pRESTO is a toolkit for processing raw reads from high-throughput sequencing of B cell and T cell repertoires.

Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of lymphocyte repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. The REpertoire Sequencing TOolkit (pRESTO) is composed of a suite of utilities to handle all stages of sequence processing prior to germline segment assignment. pRESTO is designed to handle either single reads or paired-end reads. It includes features for quality control, primer masking, annotation of reads with sequence embedded barcodes, generation of unique molecular identifier (UMI) consensus sequences, assembly of paired-end reads and identification of duplicate sequences. Numerous options for sequence sorting, sampling and conversion operations are also included.

Overview

Scope and Features

pRESTO performs all stages of raw sequence processing prior to alignment against reference germline sequences. The toolkit is intended to be easy to use, but some familiarity with commandline applications is expected. Rather than providing a fixed solution to a small number of common workflows, we have designed pRESTO to be as flexible as possible. This design philosophy makes pRESTO suitable for many existing protocols and adaptable to future technologies, but requires users to construct a sequence of commands and options specific to their experimental protocol.

pRESTO is composed of a set of standalone tools to perform specific tasks, often with a series of subcommands providing different behaviors. A brief description of each tool is shown in the table below.

Tool

Subcommand

Description

AlignSets.py

Multiple aligns sets of sequences sharing the same annotation

muscle

Uses the program MUSCLE to align reads

offset

Uses a table of primer alignments to align the 5’ region

table

Creates a table of primer alignments for the offset subcommand

AssemblePairs.py

Assembles paired-end reads into a complete sequence

align

Assembles paired-end reads by aligning the sequence ends

join

Concatenates pair-end reads with intervening gaps

reference

Assembles paired-end reads using V-segment references

sequential

Attempt alignment assembly followed by reference assembly

BuildConsensus.py

Constructs UMI consensus sequences

ClusterSets.py

Clusters read groups

all

Cluster all sequences regardless of annotation

barcode

Cluster reads by clustering barcode sequences

set

Cluster reads by sequence data within barcode groups

CollapseSeq.py

Removes duplicate sequences

ConvertHeaders.py

Converts sequence headers to the pRESTO format

454

Converts Roche 454 sequence headers

genbank

Converts NCBI GenBank and RefSeq sequence headers

generic

Converts sequence headers with an unknown annotation system

illumina

Converts Illumina sequence headers

imgt

Converts sequence headers output by IMGT/GENE-DB

migec

Converts sequence headers output by MIGEC

sra

Converts NCBI SRA or EMBL-EBI ENA sequence headers

EstimateError.py

Estimates error rates for UMI data

barcode

Calculates pairwise distance metrics of barcode sequences

set

Estimates error statistics within annotation sets

FilterSeq.py

Removes or modifies low quality reads

length

Removes sequences under a defined length

maskqual

Masks low Phred quality score positions with Ns

missing

Removes sequences with a high number of Ns

quality

Removes sequences with low Phred quality scores

repeats

Removes sequences with long repeats of a single nucleotide

trimqual

Trims sequences to segments with high Phred quality scores

MaskPrimers.py

Identifies and removes primer regions, MIDs and UMI barcodes

align

Matches primers by local alignment and reorients sequences

extract

Removes and annotates a fixed sequence region

score

Matches primers at a fixed user-defined start position

PairSeq.py

Sorts paired-end reads and copies annotations between them

ParseHeaders.py

Manipulates sequence annotations

add

Adds a field and value annotation pair to all reads

collapse

Compresses a set of annotation fields into a single field

copy

Copies values between annotations fields

delete

Deletes an annotation from all reads

expand

Expands an field with multiple values into separate annotations

merge

Merge multiple annotations fields into a single field

rename

Rename annotation fields

table

Outputs sequence annotations as a data table

ParseLog.py

Converts the log output of pRESTO scripts into data tables

SplitSeq.py

Performs conversion, sorting, and subsetting of sequence files

count

Splits files into smaller files

group

Splits files based on numerical or categorical annotation

sample

Randomly samples sequences from a file

samplepair

Randomly samples paired-end reads from two files

select

Filters sequences based on annotations

sort

Sorts sequences based on annotations

UnifyHeaders

Unifies annotation fields based on grouping scheme

consensus

Reassign fields to consensus values

delete

Delete sequences with differing field values.

Input and Output

All tools take as input standard FASTA or FASTQ formatted files and output files in the same formats. This allows pRESTO to work seamlessly with other sequence processing tools that use either of these data formats; any steps within a pRESTO workflow can be exchanged for an alternate tool, if desired.

Each tool appends a specific suffix to its output files describing the step and output. For example, MaskPrimers will append _primers-pass to the output file containing successfully aligned sequences and _primers-fail to the file containing unaligned sequences.

See also

Details regarding the suffixes used by pRESTO tools can be found in the Commandline Usage documentation for each tool.

Annotation Scheme

The majority of pRESTO tools manipulate and add sequences-specific annotations as part of their processing functions using the scheme shown below. Each annotation is delimited using a reserved character (| by default), with the annotation field name and values separated by a second reserved character (= by default), and each value within a field is separated by a third reserved character (, by default). These annotations follow the sequence identifier, which itself immediately follows the > (FASTA) or @ (FASTQ) symbol denoting the beginning of a new sequence entry. The sequence identifier is given the reserved field name ID. To mitigate potential analysis errors, each tool in pRESTO annotates sequences by appending values to existing annotation fields when they exist, and will not overwrite or delete annotations unless explicitly performed using the ParseHeaders tool. All reserved characters can be redefined using the command line options.

FASTA Annotation
>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
FASTQ Annotation
@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
+
!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo

See also

  • Details regarding the annotations added by pRESTO tools can be found in the Commandline Usage documentation for each tool.

  • The ParseHeaders.py tool provides a number of options for manipulating annotations in the pRESTO format.

  • The ConvertHeaders.py tool allows you convert several common annotation schemes into the pRESTO annotation format.

Download

The latest stable release of pRESTO may be downloaded from PyPI or Bitbucket.

Development versions and source code are available on Bitbucket.

Installation

The simplest way to install the latest stable release of pRESTO is via pip:

> pip3 install presto --user

The current development build can be installed using pip and git in a similar fashion:

> pip3 install git+https://bitbucket.org/kleinstein/presto@master --user

If you currently have a development version installed, then you will likely need to add the arguments --upgrade --no-deps --force-reinstall to the pip3 command.

Requirements

Linux

  1. The simplest way to install all Python dependencies is to install the full SciPy stack using the instructions, then install Biopython according to its instructions.

  2. Download the pRESTO bundle and run:

    > pip3 install presto-x.y.z.tar.gz --user
    

Mac OS X

  1. Install Xcode, which is available from the Apple store or developer downloads.

  2. Older versions Mac OS X will require you to install XQuartz 2.7.5, which is available from the XQuartz project.

  3. Install Homebrew following the installation and post-installation instructions.

  4. Install Python 3.4.0+ and set the path to the python3 executable:

    > brew install python3
    > echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile
    
  5. Exit and reopen the terminal application so the PATH setting takes effect.

  6. You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hour to install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it using Homebrew:

    > brew install gfortran
    

    If the above fails run this instead:

    > brew install --env=std gfortran
    
  7. Install NumPy, SciPy, pandas and Biopython using the Python package manager:

    > pip3 install numpy scipy pandas biopython
    
  8. Download the pRESTO bundle, open a terminal window, change directories to download location, and run:

    > pip3 install presto-x.y.z.tar.gz
    

Windows

  1. Install Python 3.4.0+ from Python, selecting both the options ‘pip’ and ‘Add python.exe to Path’.

  2. Install NumPy, SciPy, pandas and Biopython using the packages available from the Unofficial Windows binary collection.

  3. Download the pRESTO bundle, open a Command Prompt, change directories to the download folder, and run:

    > pip install presto-x.y.z.tar.gz
    
  4. For a default installation of Python 3.4, the pRESTO scripts will be installed into C:\Python34\Scripts and should be directly executable from the Command Prompt. If this is not the case, then follow step 5 below.

  5. Add both the C:\Python34 and C:\Python34\Scripts directories to your %Path%. On Windows 7 the %Path% setting is located under Control Panel -> System and Security -> System -> Advanced System Settings -> Environment variables -> System variables -> Path.

  6. If you have trouble with the .py file associations, try adding .PY to your PATHEXT environment variable. Also, opening a command prompt as Administrator and run:

    > assoc .py=Python.File
    > ftype Python.File="C:\Python34\python.exe" "%1" %*
    

Release Notes

Version 0.7.1: October 2, 2022

  • Added allowance of U characters in nucleotide distance calculations with equivalence to T.

FilterSeq:

  • Changed average quality score calculation to use the probability calculation instead of a simple mean on the scores.

Version 0.7.0: October 28, 2021

  • Updated dependencies to biopython >= v1.77.

AssemblePairs:

  • Fixed a bug causing execution failures in the sequential and reference commands when passing multiple input file sets.

ClusterSets:

  • Added the --mem argument to control the memory allocation for cd-hit-est.

  • Updated the default usearch/vsearch clustering parameters for the barcode subcommand to only require 1 word match (see the minwordmatches argument to usearch/vsearch), as the default value of 12 was automatically excluding short sequence matches. The set subcommand retains the default minwordmatches value of 12.

EstimateError:

  • Fixed a bug in the set command causing an NaN to integer conversion error with some data sets.

  • Added the the --pad argument to the barcode command to control how to deal with truncated barcode sequences. Barcode sequences have to be identical in length, so --pad provides a mechanism to either pad the ends of short barcodes with N characters or exclude barcodes shorter than the maximum length barcode from the distance calculations.

MaskPrimers:

  • Will now modify the identifiers for primers specified via -p to avoid downstream issues parsing sequence headers. Whitespace and the default annotation delimiters (|=,) are now replaced with underscore characters (_).

Version 0.6.2: October 13, 2020

  • Updated to support Biopython v1.78.

Version 0.6.1: July 15, 2020

  • Fixed a bug in the blast wrappers caused by SeqIO.FastaIO.FastaWriter deprecation and increased the biopython dependency to v1.71.

Version 0.6.0: May 6, 2020

  • License changed to AGPL-3.

AssemblePairs:

  • Added support for older Illumina headers without an index read when specifying --coord illumina.

ConvertHeaders:

  • Added support for older Illumina headers without and index read to the illumina subcommand.

FilterSeq:

  • Fixed a bug in the quality subcommand that caused the program to exit when encountering sequences that are entirely N characters.

PairSeq:

  • Added support for older Illumina headers without an index read when specifying --coord illumina.

Version 0.5.13: August 29, 2019

  • Fixed .ix pandas deprecation warning.

Version 0.5.12: August 5, 2019

  • Increased pandas version requirement to v0.24+ with compatibility fixes for pandas v0.24.0.

Version 0.5.11: January 30, 2019

  • Slightly changed version number display in commandline help.

ClusterSets:

  • Removed the --log and --failed arguments from the all and barcode subcommands because they do nothing.

  • Added the --length argument to all subcommands which defines the ratio of minimum sequence lengths allowed in a cluster.

  • Added error handling for when --ident is less than recommended value.

  • Increased maximum memory allocation for cd-hit-est to 3GB.

Version 0.5.10: October 19, 2018

  • Documentation added for UMI error correction.

  • Added IO.printDebug to the API.

EstimateError:

  • Added the subcommands set and barcode where set is the previous error estimation method (using UMI read groups). The new barcode method generates pairwise Hamming distance distributions for sequences containied in header annotations (typically UMI barcode sequences).

  • Added distance and threshold output files containing pairwise distance histograms and clustering thresholds for use with the corresponding subcommands in ClusterSets.

  • Increased default minimum sequence count in the set subcommand to 20.

Version 0.5.9: September 2, 2018

  • Add the -o argument to most tools, which allows explicit declaration of the output file name.

  • Added IO.printWarning and IO.printError API functions for handling standard error messaging.

  • Split IO.printProgress API function into IO.printProgress (percentage) and IO.printCount (raw counts).

  • Moved a significant number of functions and classes from the executable scripts into the API.

MaskPrimers:

  • Removed support for the regex primer file format.

AssemblePairs:

  • Changed default of the --rc argument to tail and added none option for previous default.

Version 0.5.8: July 13, 2018

  • Fixed installation incompatibility with pip 10.

ClusterSets:

  • Added support for CD-HIT.

EstimateError:

  • Significant performance improvements.

Version 0.5.7: March 19, 2018

BuildConsensus:

  • Fixed an error wherein the program would exit if all sequences in an UMI read group had a Phred quality score of 0 in a given position.

ConvertHeaders:

  • Added support for EMBL-EBI ENA header format to the sra subcommand.

MaskPrimers:

  • Added extract subcommand which will remove/annotation subsequences in fixed position without requiring a primer sequence match.

  • Added --pf and --bf arguments to all subcommands allowing renaming of output PRIMER and BARCODE fields, respectively.

  • Removed SEQORIENT output field from score subcommand as the mode does not check the reverse complement.

PairSeq:

  • Added the --act argument to provide a mechanism for collapsing values of duplicate fields copied across files.

ParseHeaders:

  • Added merge subcommand to combined separate annotations into a single entry.

Version 0.5.6: January 17, 2018

CollapseSeq:

  • Fixed a bug causing copy fields (--cf argument) to be processed incorrectly.

UnifyHeaders:

  • Improved reporting in log file.

Version 0.5.5: December 26, 2017

AssemblePairs:

  • Fixed a bug that caused the align subcommand to error if input sequences where shorter than the minimum specified by the –minlen argument. It will now simply fail such sequences.

ClusterSets:

  • Moved functionality of previous ClusterSets command into the set subcommand.

  • Added the all subcommand to cluster all sequences without considering annotation groups.

  • Added the barcode subcommand which allows for clustering of reads based on a barcode sequence instead of the read data.

  • Renamed -id argument to --ident for consistency with AssemblePairs.

CollapseSeq:

  • Fixed a bug wherein CollapseSeq would match partial sequences against longer sequences that were otherwise identical up until the missing end characters.

  • Added detailed log output.

EstimateError:

  • Fixed a division by zero warning in the log output when there were no observed mismatches.

UnifyHeaders:

  • New tool to generate consensus annotations or filter reads based on annotation groupings.

Version 0.5.4: July 1, 2017

  • All tools will now print detailed help if no arguments are provided.

AlignSets:

  • Fixed a typo in the console log of the muscle subcommand.

ConvertHeaders:

  • Added the migec subcommand to convert headers from the MIGEC tool.

EstimateError:

  • Fixed a division by zero error when there were no observed mismatches.

  • Bounded error rate to a minimum of 10^-9 (Q=90).

Version 0.5.3: February 14, 2017

License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

AssemblePairs:

  • Changed the behavior of the --failed argument so that failed output are in the same orientation as the input sequences. Meaning, the --rc argument is ignored for failed output.

  • Added the sequential subcommand which will first attempt de novo assembly (align subcommand) following by reference guided assembly (reference subcommand) if de novo assembly fails.

  • Added blastn compatibility to reference subcommand.

  • Added the option --aligner to the reference subcommand to allow use of either blastn or usearch for performing the local alignment. Defaults to the usearch algorithm used in previous releases.

  • Added the option --dbexec to the reference subcommand to allow specification of the reference database build tool (eg, makeblastdb).

  • Changed masking behavior to none and word length to 9 in reference subcommand when using usearch as the aligner.

  • Internal modifications to the reference subcommand to rebuild the database before alignments for performance reasons.

  • Fixed a deprecation warning appearing with newer versions of numpy.

BuildConsensus:

  • Fixed a bug in the read group error rate calculation wherein either a consensus sequence or read group that was completely N characters would cause the program to exit with a division by zero error. Now, such non-informative read groups will be assigned an error rate of 1.0.

ClusterSets:

  • Added vsearch compatibility.

  • Fixed a bug wherein sets containing empty sequences were being fed to usearch, rather than automatically failed, which would cause usearch v8 to hang indefinitely.

  • Fixed an incompatibility with usearch v9 due to changes in the way usearch outputs sequence labels.

  • Changed masking behavior of usearch to none.

  • Changed how gaps are handling before passing sequences to usearch. Gaps are now masked (with Ns) for clustering, instead of removed.

EstimateError:

  • Fixed a fatal error with newer versions of pandas.

SplitSeq:

  • Added the select subcommand, which allows filtering of sequences based on annotation value matches or mismatches.

  • Altered the behavior of the -u argument for both the sample and samplepair subcommands. If -u is specified, sampling is performed as in previous versions wherein samples will be drawn from only fields with the specified annotation values up to n total reads. However, if -u is not specified with -f repeated sampling will now be performed for each unique annotation value in the specified field, generating output with up to n reads per unique annotation value.

Version 0.5.2: March 8, 2016

Fixed a bug with installation on Windows due to old file paths lingering in presto.egg-info/SOURCES.txt.

Improvements to commandline usage help messages.

Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0.

AssemblePairs:

  • Added the flag --fill to the reference subcommand to allow insertion of the reference sequence into the non-overlapping region of assembled sequences. Use caution when using this flag, as this may lead to chimeric sequences.

  • Changed default --minlen to 8 in align subcommand.

Version 0.5.1: December 4, 2015

ClusterSets:

  • Fixed bug wherein --failed flag did not work.

Version 0.5.0: September 7, 2015

Conversion to a proper Python package which uses pip and setuptools for installation.

The package now requires Python 3.4. Python 2.7 is not longer supported.

The required dependency versions have been bumped to numpy 1.8, scipy 0.14, pandas 0.15, and biopython 1.65.

IgCore:

  • Divided IgCore functionality into the separate modules: Annotation, Commandline, Defaults, IO, Multiprocessing and Sequence.

Version 0.4.8: September 7, 2015

Added support for additional input FASTA (.fna, .fa), FASTQ (.fq) and tab-delimited (.tsv) file extensions.

ParseHeaders:

  • Fixed a bug in the rename subcommand wherein renaming to an existing field deleted the old annotation, but did not merge the renamed annotation into the existing field.

  • Added the copy subcommand which will copy annotations into new field names or merge the annotations of existing fields.

  • Added the --act argument to the copy and rename subcommands allowing collapse following the copy or rename operation.

  • Added a commandline check to ensure that the -f, -k and --act arguments contain the same number of fields for both the rename and copy subcommands.

Version 0.4.7: June 5, 2015

IgCore:

  • Modified scoring functions to permit asymmetrical scores for N and gap characters.

AssemblePairs:

  • Added support for SRA style coordinate information where the where the read number has been appended to the spot number.

  • Altered scoring so gap characters are counted as mismatches in the error rate and identity calculations.

BuildConsensus:

  • Altered scoring so gap characters are counted as mismatches in the diversity and error rate calculations.

ConvertHeaders:

  • Added support for SRA style sequence headers where the read number has been appended to the spot number; eg, output from fastq-dump -I --split-files file.sra.

ClusterSets:

  • Added missing OUTPUT console log field.

  • Changed --bf and --cf arguments to -f and -k, respectively.

MaskPrimers:

  • Altering scoring behavior for N characters such that Ns in the input sequence are always counted as a mismatch, while Ns in the primer sequence are counted as a match, with priority given to the input sequence score.

  • Added --gap argument to the align subcommand which allows users to specify the gap open and gap extension penalties for aligning primers. Note: gap penalties reduce the match count for purposes of calculating ERROR.

PairSeq:

  • Added support for SRA style coordinate information where the where the read number has been appended to the spot number.

Version 0.4.6: May 13, 2015

BuildConsensus:

  • Changed --maxmiss argument to --maxgap and altered the behavior to only perform deletion of positions based on gap characters (only “-” or “.” and not “N” characters).

  • Added an error rate (--maxerror) calculation based on mismatches from consensus. The --maxerror argument is mutually exclusive with the --maxdiv argument and provides similar functionality. However, the calculations are not equivalent, and --maxerror should be considerably faster than --maxdiv.

  • Added exclusion of positions from the error rate calculation that are deleted due to exceeding the --maxgap threshold .

  • Fixed misalignment of consensus sequence against input sequences when positions are deleted due to exceeding the --maxgap threshold.

ClusterSets:

  • New script to cluster read groups by barcode field (eg, UID barcodes) into clustering within the read group.

ConvertHeaders:

  • New script to handle conversion of different sequence description formats to the pRESTO format.

FilterSeq:

  • Added count of masked characters to log output of maskqual subcommand.

  • Changed repeats subcommand log field REPEAT to REPEATS.

PairSeq:

  • Changed -f argument to --1f argument.

  • Added --2f argument to copy file 2 annotations to file 1.

ParseHeaders:

  • Moved convert subcommand to the generic subcommand of the new ConvertHeaders script and modified the conversion behavior.

Version 0.4.5: March 20, 2015

Added details to the usage documentation for each tool which describes both the output files and annotation fields.

Renamed --clean argument to --failed argument with opposite behavior, such that the default behavior of all scripts is now clean output.

IgCore:

  • Features added for Change-O compatibility.

  • Features added for PairSeq performance improvements.

  • Added custom help formatter.

  • Modifications to internals of multiprocessing code.

  • Fixed a few typos in error messages.

AssemblePairs:

  • Added reference subcommand which uses V-region germline alignments from ublast to assemble paired-ends.

  • Removed mate-pair matching operation to increase performance. Now requires both input files to contain matched and uniformly ordered reads. If files are not synchronized then PairSeq must be run first. AssemblePairs will check that coordinate info matches and error if the files are not synchronized. Unpaired reads are no longer output.

  • Added support for cases where one mate pair is the subsequence of the other.

  • Added --scanrev argument to allow for head sequence to overhang end of tail.

  • Removed truncated (quick) error calculation in align subcommand.

  • Changed default values of the --maxerror and --alpha arguments of the align subcommand to better tuned parameters.

  • Changed internal selection of top scoring alignment to use Z-score approximation rather than a combination of error rate and binomial mid-p value.

  • Internal changes to multiprocessing structure.

  • Changed inserted gap character from - to N in join subcommand for better compatibility with the behavior of IMGT/HighV-QUEST.

  • Changed PVAL log field to PVALUE.

  • Changed HEADSEQ and TAILSEQ log fields to SEQ1 and SEQ2.

  • Changed HEADFIELDS and TAILFIELDS log fields to FIELDS1 and FIELDS2.

  • Changed precision of ERROR and PVALUE log fields.

  • Added more verbose logging.

BuildConsensus:

  • Fixed bug where low quality positions where not being masked in single sequence barcode groups.

  • Added copy field (--cf) and copy action (--act) arguments to generate consensus annotations for barcode read groups.

  • Changed maximum consensus quality score from 93 to 90.

CollapseSeq:

  • Added --keep argument to allow retention of sequences with high missing character counts in unique sequence output file.

  • Removed case insensitivity for performance reasons. Now requires all sequences to have matching case.

  • Removed first and last from --act choices to avoid unexpected behavior.

MaskPrimers:

  • Changed behavior of N characters in primer identification. Ns now count as a match against any character, rather than a mismatch.

  • Changed behavior of mask mode such that positions masked with Ns are now assigned quality scores of 0, rather than retaining their previous scores.

  • Fixed a bug with the align subcommand where deletions within the input sequence (gaps in the alignment) were causing an incorrect barcode start position.

PairSeq:

  • Performance improvements. The tool should now be considerably faster on very large files.

  • Specifying the --failed argument to request output of sequences which do not have a mate pair will increase run time and memory usage.

ParseHeaders:

  • Add ‘cat’ action to collapse subcommand which concatenates strings into a single annotation.

SplitSeq:

  • Removed --clean (and --failed) flag from all subcommands.

  • Added progress updates to sample and samplepair subcommands.

  • Performance improvements to samplepair subcommand.

Version 0.4.4: June 10, 2014

SplitSeq:

  • Removed a linux-specific dependency, allowing SplitSeq to work on Windows.

Version 0.4.3: April 7, 2014

CollapseSeq:

  • Fixed bug that occurs with Python 2.7.5 on OS X.

SplitSeq:

  • Fixed bug in samplepairs subcommand that occurs with Python 2.7.5 on OS X.

Version 0.4.2: March 20, 2014

Increased verbosity of exception reporting.

IgCore:

  • Updates to consensus functions to support changes to BuildConsensus.

AssemblePairs:

  • Set default alpha to 0.01.

BuildConsensus:

  • Added support for --freq value parameter to quality consensus method and set default value to 0.6.

  • Fixed a bug in the frequency consensus method where missing values were contributing to the total character count at each position.

  • Added the parameter --maxmiss value which provides a cut-off for removal of positions with too many N or gap characters .

MaskPrimers:

  • Renamed the --reverse parameter to --revpr.

SplitSeq:

  • Removed convert subcommand.

Version 0.4.1: January 27, 2014

Changes to the internals of multiple tools to provide support for multiprocessing in Windows environments.

Changes to the internals of multiple tools to provide clean exit of child processes upon kill signal or exception in sibling process.

Fixed unexpected behavior of --outname and --log arguments with multiple input files.

IgCore:

  • Added reporting of unknown exceptions when reading sequence files

  • Fixed scoring of lowercase sequences.

AlignSets:

  • Fixed a typo in the log output.

BuildConsensus:

  • Fixed a typo in the log output.

EstimateError:

  • Fixed bug where tool would improperly exit if no sets passed threshold criteria.

  • Fixed typo in console output.

MaskPrimers:

  • Added trim mode which will cut the region before primer alignment, but leave primer region unmodified.

  • Fixed a bug with lowercase sequence data.

  • Fixed bug in the console and log output.

  • Added support for primer matching when setting --maxerr 1.0.

ParseHeaders:

  • Added count of sequences without any valid fields (FAIL) to console output.

ParseLog:

  • Added count of records without any valid fields (FAIL) to console output.

SplitSeq:

  • Fixed typo in console output of samplepair subcommand.

  • Added increase of the open file limit to the group subcommand to allow for a large number of groups.

Version 0.4.0: September 30, 2013

Minor name changes were made to multiple scripts, functions, parameters, and output files.

AlignSets, AssemblePairs, BuildConsensus, EstimateError, FilterSeq, and MaskPrimers are now multithreaded. The number of simultaneous processes may be specified using --nproc value. Note this means file ordering is no longer preserved between the input and output sequence files.

Performance improvements were made to several tools.

The universal --verbose parameter was replaced with --log file_name which specifies a log file for verbose output, and disables verbose logging if not specified.

The report of input parameters and sequence counts is now separate from the log and is always printed to standard output.

Added a progress bar to the standard output of most tools.

Added a universal --outname file_prefix parameter which changes the leading portion of the output file name. If not specified, the current file name is used (excluding the file extension, as per the previous behavior).

Added a universal --clean parameter which if specified forces the tool not to create an output file of sequences which failed processing.

IgCore:

  • Changes to parameters and internals of multiple functions.

  • Added functions to support multithreading for single-end reads, paired-end reads, and barcode sets.

  • Added safe annotation field renaming.

  • Added progress bar, logging and output file name conversion support.

  • Moved reusable AssemblePairs, BuildConsensus, PairSeq, and SplitSeq. operations into IgCore.

AssemblePairs:

  • Coordinate information is now specified by a coordinate type, rather than a delimiter, using the --coord header_type parameter, where the header type may be one of illumina, solexa, sra, 454, presto.

CollapseSeq:

  • Sequences with a missing character count exceeding the user limit defined by -n maximum_missing_count are now exported to a separate collapse-undetermined output file, rather than included in the collapse-unique sequence output.

EstimateError:

  • Now outputs error estimations for positions, quality scores, nucleotide pairs, and annotation sets.

  • Machine reported quality scores and empirical quality scores have been added to all output tables.

FilterSeq:

  • Added length subcommand to filter sequences by minimum length.

PairSeq:

  • Coordinate information has been redefined as per AssemblePairs.

ParseHeaders:

  • Added new subcommand convert which attempts to reformat sequence headers into the pRESTO format.

  • The rename subcommand will now append entries if the new field name already exists in the sequence header, rather than replace the entry.

Version 0.3 (prerelease 6): August 13, 2013

Toolkit is now dependent upon pandas 0.12 for the estimateError tool.

alignSets:

  • Changed MUSCLE execution to faster settings (-diags, -maxiters 2).

filterQuality:

  • Added repeat subcommand to filter sequences with -n (value) repetitions of a single character and.

  • Changed -n parameter of ambig subcommand from fractional value to a raw count.

estimateError:

  • New tool which estimates error of sequence sets by comparison to a consensus.

maskPrimers:

  • Bug fixes to alignment position calculation of align subcommand when primer alignment begins before start of sequence.

  • Removed --ann parameter.

Version 0.3 (prerelease 5): August 7, 2013

License changed to Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

IgPipeline Core:

  • Bug fixes to diversity calculation.

  • Added support for files where all sequences do not share the same annotation fields.

  • Added support for alternate scoring of gap and N-valued nucleotides.

alignSets:

  • Added --mode parameter with options of pad and cut to specify whether to extend or trim read groups to the same start position.

  • Fixed intermittent ‘muscle’ subcommand stdout pipe deadlock when executing MUSCLE.

assemblePairs:

  • Added join subcommand to support library preps where paired-end reads do not overlap.

  • Speed improvements to p-value calculations.

buildConsensus:

  • --div parameter converted to --maxdiv value to allow filtering of read groups by diversity.

  • Bug fixes to nucleotide frequency consensus method.

  • -q parameter renamed to --qual.

collapseSequences:

  • Added support for files where all sequences do not share the same annotation fields.

splitSeqFile:

  • samplepair subcommand added to allow random sampling from paired-end file sets.

  • The behavior of the -c parameter of the sample and samplepair subcommands changed to allow multiple samplings with the same command.

Version 0.3 (prerelease 4): May 18, 2013

Initial public prerelease

Commandline Usage

AlignSets.py

Multiple aligns input sequences by group

usage: AlignSets.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
align-pass

multiple aligned reads.

align-fail

raw reads failing multiple alignment.

offsets-forward

5’ offset table for input into offset subcommand.

offsets-reverse

3’ offset table for input into offset subcommand.

output annotation fields:

None

AlignSets.py muscle

Align sequence sets using muscle.

usage: AlignSets.py muscle [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                           [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [--bf BARCODE_FIELD] [--div]
                           [--exec ALIGNER_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--bf <barcode_field>

The annotation field containing barcode labels for sequence grouping.

--div

Specify to calculate nucleotide diversity of each set (average pairwise error rate).

--exec <aligner_exec>

The name or location of the muscle executable.

AlignSets.py offset

Align sequence sets using predefined 5’ offset.

usage: AlignSets.py offset [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                           [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [-d OFFSET_TABLE]
                           [--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
                           [--mode {pad,cut}] [--div]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-d <offset_table>

The tab delimited file of offset tags and values.

--bf <barcode_field>

The annotation field containing barcode labels for sequence grouping.

--pf <primer_field>

The primer field to use for offset assignment.

--mode {pad,cut}

Specifies whether or align sequence by padding with gaps or by cutting the 5’ sequence to a common start position.

--div

Specify to calculate nucleotide diversity of each set (average pairwise error rate).

AlignSets.py table

Create a 5’ offset table by primer multiple alignment.

usage: AlignSets.py table [--version] [-h] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--failed]
                          [--delim DELIMITER DELIMITER DELIMITER] -p
                          PRIMER_FILE [-o OUT_FILE] [--reverse]
                          [--exec ALIGNER_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-p <primer_file>

A FASTA file containing primer sequences.

-o <out_file>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--reverse

If specified create a 3’ offset table instead

--exec <aligner_exec>

The name or location of the muscle executable.

AssemblePairs.py

Assembles paired-end reads into a single sequence

usage: AssemblePairs.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
assemble-pass

successfully assembled reads.

assemble-fail

raw reads failing paired-end assembly.

output annotation fields:
<user defined>

annotation fields specified by the –1f or –2f arguments.

AssemblePairs.py align

Assemble pairs by aligning ends.

usage: AssemblePairs.py align [--version] [-h] -1 SEQ_FILES_1
                              [SEQ_FILES_1 ...] -2 SEQ_FILES_2
                              [SEQ_FILES_2 ...] [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--log LOG_FILE] [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
                              [--nproc NPROC]
                              [--coord {illumina,solexa,sra,454,presto}]
                              [--rc {tail,head,both,none}]
                              [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                              [--2f TAIL_FIELDS [TAIL_FIELDS ...]]
                              [--alpha ALPHA] [--maxerror MAX_ERROR]
                              [--minlen MIN_LEN] [--maxlen MAX_LEN]
                              [--scanrev]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

--alpha <alpha>

Significance threshold for de novo paired-end assembly.

--maxerror <max_error>

Maximum allowable error rate for de novo assembly.

--minlen <min_len>

Minimum sequence length to scan for overlap in de novo assembly.

--maxlen <max_len>

Maximum sequence length to scan for overlap in de novo assembly.

--scanrev

If specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhang the end of the tail sequence.

AssemblePairs.py join

Assemble pairs by concatenating ends.

usage: AssemblePairs.py join [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...]
                             -2 SEQ_FILES_2 [SEQ_FILES_2 ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                             [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                             [--nproc NPROC]
                             [--coord {illumina,solexa,sra,454,presto}]
                             [--rc {tail,head,both,none}]
                             [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                             [--2f TAIL_FIELDS [TAIL_FIELDS ...]] [--gap GAP]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

--gap <gap>

Number of N characters to place between ends.

AssemblePairs.py reference

Assemble pairs by aligning reads against a reference database.

usage: AssemblePairs.py reference [--version] [-h] -1 SEQ_FILES_1
                                  [SEQ_FILES_1 ...] -2 SEQ_FILES_2
                                  [SEQ_FILES_2 ...]
                                  [-o OUT_FILES [OUT_FILES ...]]
                                  [--outdir OUT_DIR] [--outname OUT_NAME]
                                  [--log LOG_FILE] [--failed] [--fasta]
                                  [--delim DELIMITER DELIMITER DELIMITER]
                                  [--nproc NPROC]
                                  [--coord {illumina,solexa,sra,454,presto}]
                                  [--rc {tail,head,both,none}]
                                  [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                                  [--2f TAIL_FIELDS [TAIL_FIELDS ...]] -r
                                  REF_FILE [--minident MIN_IDENT]
                                  [--evalue EVALUE] [--maxhits MAX_HITS]
                                  [--fill] [--aligner {blastn,usearch}]
                                  [--exec ALIGNER_EXEC] [--dbexec DB_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

-r <ref_file>

A FASTA file containing the reference sequence database.

--minident <min_ident>

Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and 1).

--evalue <evalue>

Minimum E-value for reference alignment for both the head and tail sequence.

--maxhits <max_hits>

Maximum number of hits from the reference alignment to check for matching head and tail sequence assignments.

--fill

Specify to change the behavior of inserted characters when the head and tail sequences do not overlap during reference guided assembly. If specified, this will result in inserted of the V region reference sequence instead of a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences by using this option.

--aligner {blastn,usearch}

The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).

--exec <aligner_exec>

The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the –aligner argument.

--dbexec <db_exec>

The name or location of the executable file that builds the reference database. This defaults to makeblastdb when blastn is specified to the –aligner argument, and usearch when usearch is specified.

AssemblePairs.py sequential

Assemble pairs by first attempting de novo assembly, then reference guided assembly.

usage: AssemblePairs.py sequential [--version] [-h] -1 SEQ_FILES_1
                                   [SEQ_FILES_1 ...] -2 SEQ_FILES_2
                                   [SEQ_FILES_2 ...]
                                   [-o OUT_FILES [OUT_FILES ...]]
                                   [--outdir OUT_DIR] [--outname OUT_NAME]
                                   [--log LOG_FILE] [--failed] [--fasta]
                                   [--delim DELIMITER DELIMITER DELIMITER]
                                   [--nproc NPROC]
                                   [--coord {illumina,solexa,sra,454,presto}]
                                   [--rc {tail,head,both,none}]
                                   [--1f HEAD_FIELDS [HEAD_FIELDS ...]]
                                   [--2f TAIL_FIELDS [TAIL_FIELDS ...]]
                                   [--alpha ALPHA] [--maxerror MAX_ERROR]
                                   [--minlen MIN_LEN] [--maxlen MAX_LEN]
                                   [--scanrev] -r REF_FILE
                                   [--minident MIN_IDENT] [--evalue EVALUE]
                                   [--maxhits MAX_HITS] [--fill]
                                   [--aligner {blastn,usearch}]
                                   [--exec ALIGNER_EXEC] [--dbexec DB_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired ends.

--rc {tail,head,both,none}

Specify which read to reverse complement before stitching.

--1f <head_fields>

Specify annotation fields to copy from head records into assembled record.

--2f <tail_fields>

Specify annotation fields to copy from tail records into assembled record.

--alpha <alpha>

Significance threshold for de novo paired-end assembly.

--maxerror <max_error>

Maximum allowable error rate for de novo assembly.

--minlen <min_len>

Minimum sequence length to scan for overlap in de novo assembly.

--maxlen <max_len>

Maximum sequence length to scan for overlap in de novo assembly.

--scanrev

If specified, scan past the end of the tail sequence in de novo assembly to allow the head sequence to overhang the end of the tail sequence.

-r <ref_file>

A FASTA file containing the reference sequence database.

--minident <min_ident>

Minimum identity of the assembled sequence required to call a valid reference guided assembly (between 0 and 1).

--evalue <evalue>

Minimum E-value for reference alignment for both the head and tail sequence.

--maxhits <max_hits>

Maximum number of hits from the reference alignment to check for matching head and tail sequence assignments.

--fill

Specify to change the behavior of inserted characters when the head and tail sequences do not overlap during reference guided assembly. If specified, this will result in inserted of the V region reference sequence instead of a sequence of Ns in the non-overlapping region. Warning: you could end up making chimeric sequences by using this option.

--aligner {blastn,usearch}

The local alignment tool to use. Must be one blastn (blast+ nucleotide) or usearch (ublast algorithm).

--exec <aligner_exec>

The name or location of the aligner executable file (blastn or usearch). Defaults to the name specified by the –aligner argument.

--dbexec <db_exec>

The name or location of the executable file that builds the reference database. This defaults to makeblastdb when blastn is specified to the –aligner argument, and usearch when usearch is specified.

BuildConsensus.py

Builds a consensus sequence for each set of input sequences

usage: BuildConsensus.py [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                         [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                         [--nproc NPROC] [-n MIN_COUNT] [--bf BARCODE_FIELD]
                         [-q MIN_QUAL] [--freq MIN_FREQ] [--maxgap MAX_GAP]
                         [--pf PRIMER_FIELD] [--prcons PRIMER_FREQ]
                         [--cf COPY_FIELDS [COPY_FIELDS ...]]
                         [--act {min,max,sum,set,majority} [{min,max,sum,set,majority} ...]]
                         [--dep]
                         [--maxdiv MAX_DIVERSITY | --maxerror MAX_ERROR]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <min_count>

The minimum number of sequences needed to define a valid consensus.

--bf <barcode_field>

Position of description barcode field to group sequences by.

-q <min_qual>

Consensus quality score cut-off under which an ambiguous character is assigned; does not apply when quality scores are unavailable.

--freq <min_freq>

Fraction of character occurrences under which an ambiguous character is assigned.

--maxgap <max_gap>

If specified, this defines a cut-off for the frequency of allowed gap values for each position. Positions exceeding the threshold are deleted from the consensus. If not defined, positions are always retained.

--pf <primer_field>

Specifies the field name of the primer annotations

--prcons <primer_freq>

Specify to define a minimum primer frequency required to assign a consensus primer, and filter out sequences with minority primers from the consensus building step.

--cf <copy_fields>

Specifies a set of additional annotation fields to copy into the consensus sequence annotations.

--act {min,max,sum,set,majority}

List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” combines annotations into a comma delimited list of unique values and adds an annotation named <FIELD>_COUNT specifying the count of each item in the set. The action “majority” assigns the most frequent annotation to the consensus annotation and adds an annotation named <FIELD>_FREQ specifying the frequency of the majority value.

--dep

Specify to calculate consensus quality with a non-independence assumption

--maxdiv <max_diversity>

Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and remove groups exceeding the given diversity threshold. Diversity is calculate for all positions within the read group, ignoring any character filtering imposed by the -q, –freq and –maxgap arguments. Mutually exclusive with –maxerror.

--maxerror <max_error>

Specify to calculate the error rate of each read group (rate of mismatches from consensus) and remove groups exceeding the given error threshold. The error rate is calculated against the final consensus sequence, which may include masked positions due to the -q and –freq arguments and may have deleted positions due to the –maxgap argument. Mutually exclusive with –maxdiv.

output files:
consensus-pass

consensus reads.

consensus-fail

raw reads failing consensus filtering criteria.

output annotation fields:
PRIMER

a comma delimited list of unique primer annotations found within the barcode read group.

PRCOUNT

a comma delimited list of the corresponding counts of unique primer annotations.

PRCONS

the majority primer within the barcode read group.

PRFREQ

the frequency of the majority primer.

CONSCOUNT

the count of reads within the barcode read group which contributed to the consensus sequence. This is the total size of the read group, minus sequence excluded due to user defined filtering criteria.

ClusterSets.py

Cluster sequences by group

usage: ClusterSets.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
cluster-pass

clustered reads.

cluster-fail

raw reads failing clustering.

output annotation fields:
CLUSTER

a numeric cluster identifier defining the within-group cluster.

ClusterSets.py all

Cluster all sequences regardless of annotation.

usage: ClusterSets.py all [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER]
                          [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                          [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                          [--cluster {usearch,vsearch,cd-hit-est}]
                          [--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
                          [--start SEQ_START] [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.

--mem <cluster_memory>

The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.

ClusterSets.py barcode

Cluster reads by clustering barcode sequences.

usage: ClusterSets.py barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
                              [--nproc NPROC] [-k CLUSTER_FIELD]
                              [--ident IDENT] [--length LENGTH_RATIO]
                              [--prefix CLUSTER_PREFIX]
                              [--cluster {usearch,vsearch,cd-hit-est}]
                              [--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
                              [-f BARCODE_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.

--mem <cluster_memory>

The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <barcode_field>

The annotation field containing barcode sequences.

ClusterSets.py set

Cluster sequences within annotation sets.

usage: ClusterSets.py set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                          [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                          [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                          [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                          [--cluster {usearch,vsearch,cd-hit-est}]
                          [--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
                          [-f SET_FIELD] [--start SEQ_START] [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.

--mem <cluster_memory>

The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <set_field>

The annotation field containing annotations, such as UMI barcode, for sequence grouping.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.

CollapseSeq.py

Removes duplicate sequences from FASTA/FASTQ files

usage: CollapseSeq.py [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                      [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                      [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                      [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                      [-n MAX_MISSING] [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
                      [--cf COPY_FIELDS [COPY_FIELDS ...]]
                      [--act {min,max,sum,set} [{min,max,sum,set} ...]]
                      [--inner] [--keepmiss]
                      [--maxf MAX_FIELD | --minf MIN_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_missing>

Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides. Note, setting a value above 0 will consider ambiguous/missing nucleotides via a distance calculation, but is considerably more computationally expensive, especially on large data sets.

--uf <uniq_fields>

Specifies a set of annotation fields that must match for sequences to be considered duplicates.

--cf <copy_fields>

Specifies a set of annotation fields to copy into the unique sequence output.

--act {min,max,sum,set}

List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.

--inner

If specified, exclude consecutive missing characters at either end of the sequence.

--keepmiss

If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.

--maxf <max_field>

Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.

--minf <min_field>

Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.

output files:
collapse-unique

unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.

collapse-duplicate

raw reads which are duplicates of the sequences retained in the collapse-unique file.

collapse-undetermined

raw reads which were excluded from consideration due to having too many N characters in the sequence.

output annotation fields:
DUPCOUNT

total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.

<user defined>

annotation fields specified by the –cf parameter.

ConvertHeaders.py

Converts sequence headers to the pRESTO format

usage: ConvertHeaders.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
convert-pass

reads passing header conversion.

convert-fail

raw reads failing header conversion.

output annotation fields:
<format defined>

the annotation fields added are specific to the header format of the input file.

ConvertHeaders.py 454

Converts Roche 454 sequence headers.

usage: ConvertHeaders.py 454 [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--failed] [--fasta]
                             [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders.py genbank

Converts NCBI GenBank and RefSeq sequence headers.

usage: ConvertHeaders.py genbank [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                                 [-o OUT_FILES [OUT_FILES ...]]
                                 [--outdir OUT_DIR] [--outname OUT_NAME]
                                 [--failed] [--fasta]
                                 [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders.py generic

Converts sequence headers without a known annotation system.

usage: ConvertHeaders.py generic [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                                 [-o OUT_FILES [OUT_FILES ...]]
                                 [--outdir OUT_DIR] [--outname OUT_NAME]
                                 [--failed] [--fasta]
                                 [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders.py illumina

Converts Illumina sequence headers.

usage: ConvertHeaders.py illumina [--version] [-h] -s SEQ_FILES
                                  [SEQ_FILES ...]
                                  [-o OUT_FILES [OUT_FILES ...]]
                                  [--outdir OUT_DIR] [--outname OUT_NAME]
                                  [--failed] [--fasta]
                                  [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders.py imgt

Converts sequence headers output by IMGT/GENE-DB.

usage: ConvertHeaders.py imgt [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
                              [--simple]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--simple

If specified, only the allele name, and no other annotations, will appear in the converted sequence header.

ConvertHeaders.py migec

Converts headers for consensus sequence generated by the MIGEC tool.

usage: ConvertHeaders.py migec [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                               [-o OUT_FILES [OUT_FILES ...]]
                               [--outdir OUT_DIR] [--outname OUT_NAME]
                               [--failed] [--fasta]
                               [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

ConvertHeaders.py sra

Converts NCBI SRA or EMBL-EBI ENA sequence headers.

usage: ConvertHeaders.py sra [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--failed] [--fasta]
                             [--delim DELIMITER DELIMITER DELIMITER]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

EstimateError.py

Calculates annotation set error rates

usage: EstimateError.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
error-position

estimated error by read position.

error-quality

estimated error by the quality score assigned within the input file.

error-nucleotide

estimated error by nucleotide.

error-set

estimated error by annotation set size.

distance-set

pairwise hamming distances by annotation set.

threshold-set

thresholds from pairwise hamming distances for annotation sets.

distance-barcode

estimated error by pairwise hamming distances

threshold-barcode

thresholds from pairwise hamming distances for clustering barcodes

output fields:
POSITION

read position with base zero indexing.

Q

Phred quality score.

OBSERVED

observed nucleotide value.

REFERENCE

consensus nucleotide for the barcode read group.

SET_COUNT

barcode read group size.

REPORTED_Q

mean Phred quality score reported within the input file for the given position, quality score, nucleotide or read group.

MISMATCHES

count of observed mismatches from consensus for the given position, quality score, nucleotide or read group.

OBSERVATIONS

total count of observed values for each position, quality score, nucleotide or read group size.

ERROR

estimated error rate.

EMPIRICAL_Q

estimated error rate converted to a Phred quality score.

ALL

histogram (count) of all pairwise distance distribution.

DTN

histogram (count) of distance to nearest distribution.

DISTANCE

length normalized hamming distance.

EstimateError.py barcode

Calculates pairwise distance metrics of barcode sequences.

usage: EstimateError.py barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                                [--outdir OUT_DIR] [--outname OUT_NAME]
                                [--delim DELIMITER DELIMITER DELIMITER]
                                [-f BARCODE_FIELD] [--pad {none,head,tail}]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <barcode_field>

The name of the barcode field. Note, barcodes are expected to all be identical length. Barcode sequences shorter than the maximum barcode length will be excluded from the distance calculations.

--pad {none,head,tail}

Specifies the action to take for barcode sequences shorter than the maximum barcode length. The “none” action will exclude truncated barcodes from the distance calculations. The “head” and “tail” actions will add N characters to either the front or back, respectively, of truncated barcode sequence to give all barcodes identical length. N characters will be treated as mismatches in the distance calculation.

EstimateError.py set

Estimates error statistics within annotation sets.

usage: EstimateError.py set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [--outdir OUT_DIR] [--outname OUT_NAME]
                            [--log LOG_FILE]
                            [--delim DELIMITER DELIMITER DELIMITER]
                            [--nproc NPROC] [-f SET_FIELD] [-n MIN_COUNT]
                            [--mode {freq,qual}] [-q MIN_QUAL]
                            [--freq MIN_FREQ] [--maxdiv MAX_DIVERSITY]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-f <set_field>

The name of the annotation field to group sequences by

-n <min_count>

The minimum number of sequences needed to consider a set

--mode {freq,qual}

Specifies which method to use to determine the consensus sequence. The “freq” method will determine the consensus by nucleotide frequency at each position and assign the most common value. The “qual” method will weight values by their quality scores to determine the consensus nucleotide at each position.

-q <min_qual>

Consensus quality score cut-off under which an ambiguous character is assigned.

--freq <min_freq>

Fraction of character occurrences under which an ambiguous character is assigned.

--maxdiv <max_diversity>

Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and exclude groups which exceed the given diversity threshold.

FilterSeq.py

Filters sequences in FASTA/FASTQ files

usage: FilterSeq.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
<command>-pass

reads passing filtering operation and modified accordingly, where <command> is the name of the filtering operation that was run.

<command>-fail

raw reads failing filtering criteria, where <command> is the name of the filtering operation.

output annotation fields:

None

FilterSeq.py length

Filters reads by length.

usage: FilterSeq.py length [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                           [--fasta] [--nproc NPROC] [-n MIN_LENGTH] [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <min_length>

Minimum sequence length to retain.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq.py maskqual

Masks low quality positions.

usage: FilterSeq.py maskqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                             [--fasta] [--nproc NPROC] [-q MIN_QUAL]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-q <min_qual>

Quality score threshold.

FilterSeq.py missing

Filters reads by N or gap character count.

usage: FilterSeq.py missing [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                            [--fasta] [--nproc NPROC] [-n MAX_MISSING]
                            [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <max_missing>

Threshold for fraction of gap or N nucleotides.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq.py quality

Filters reads by quality score.

usage: FilterSeq.py quality [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                            [--fasta] [--nproc NPROC] [-q MIN_QUAL] [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-q <min_qual>

Quality score threshold.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq.py repeats

Filters reads by consecutive nucleotide repeats.

usage: FilterSeq.py repeats [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                            [--fasta] [--nproc NPROC] [-n MAX_REPEAT]
                            [--missing] [--inner]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-n <max_repeat>

Threshold for fraction of repeating nucleotides.

--missing

If specified count consecutive gap and N characters ‘ in addition to {A,C,G,T}.

--inner

If specified exclude consecutive missing characters at either end of the sequence.

FilterSeq.py trimqual

Trims sequences by quality score decay.

usage: FilterSeq.py trimqual [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                             [--fasta] [--nproc NPROC] [-q MIN_QUAL]
                             [--win WINDOW] [--reverse]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-q <min_qual>

Quality score threshold.

--win <window>

Nucleotide window size for moving average calculation.

--reverse

Specify to trim the head of the sequence rather than the tail.

MaskPrimers.py

Removes primers and annotates sequences with primer and barcode identifiers

usage: MaskPrimers.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
mask-pass

processed reads with successful primer matches.

mask-fail

raw reads failing primer identification.

output annotation fields:
SEQORIENT

the orientation of the output sequence. Either F (input) or RC (reverse complement of input).

PRIMER

name of the best primer match.

BARCODE

the sequence preceding the primer match. Only output when the –barcode flag is specified.

MaskPrimers.py align

Find primer matches using pairwise local alignment.

usage: MaskPrimers.py align [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                            [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                            [--nproc NPROC] -p PRIMER_FILE
                            [--maxerror MAX_ERROR] [--maxlen MAX_LEN]
                            [--gap GAP_PENALTY GAP_PENALTY] [--revpr]
                            [--skiprc] [--mode {cut,mask,trim,tag}]
                            [--barcode] [--bf BARCODE_FIELD]
                            [--pf PRIMER_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-p <primer_file>

A FASTA file containing primer sequences.

--maxerror <max_error>

Maximum allowable error rate.

--maxlen <max_len>

Length of the sequence window to scan for primers.

--gap <gap_penalty>

A list of two positive values defining the gap open and gap extension penalties for aligning the primers. Note: the error rate is calculated as the percentage of mismatches from the primer sequence with gap penalties reducing the match count accordingly; this may lead to error rates that differ from strict mismatch percentage when gaps are present in the alignment.

--revpr

Specify to match the tail-end of the sequence against the reverse complement of the primers. This also reverses the behavior of the –maxlen argument, such that the search window begins at the tail-end of the sequence.

--skiprc

Specify to prevent checking of sample reverse complement sequences.

--mode {cut,mask,trim,tag}

Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region and the preceding sequence. The “mask” mode will replace the primer region with Ns and remove the preceding sequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The “tag” mode will leave the input sequence unmodified.

--barcode

Specify to annotate reads sequences with barcode sequences (unique molecular identifiers) found preceding the primer.

--bf <barcode_field>

Name of the barcode annotation field.

--pf <primer_field>

Name of the annotation field containing the primer name.

MaskPrimers.py extract

Remove and annotate a fixed sequence region.

usage: MaskPrimers.py extract [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--log LOG_FILE] [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
                              [--nproc NPROC] [--start START] --len LENGTH
                              [--revpr] [--mode {cut,mask,trim,tag}]
                              [--barcode] [--bf BARCODE_FIELD]
                              [--pf PRIMER_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--start <start>

The starting position of the sequence region to extract.

--len <length>

The length of the sequence to extract.

--revpr

Specify to extract from the tail end of the sequence with the start position relative to the end of the sequence.

--mode {cut,mask,trim,tag}

Specifies the action to take with the sequence region. The “cut” mode will remove the region. The “mask” mode will replace the specified region with Ns. The “trim” mode will remove the sequence preceding the specified region, but leave the region intact. The “tag” mode will leave the input sequence unmodified.

--barcode

Specify to remove the sequence preceding the extracted region and annotate the read with that sequence.

--bf <barcode_field>

Name of the barcode annotation field.

--pf <primer_field>

Name of the annotation field containing the extracted sequence region.

MaskPrimers.py score

Find primer matches by scoring primers at a fixed position.

usage: MaskPrimers.py score [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                            [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                            [--nproc NPROC] -p PRIMER_FILE [--start START]
                            [--maxerror MAX_ERROR] [--revpr]
                            [--mode {cut,mask,trim,tag}] [--barcode]
                            [--bf BARCODE_FIELD] [--pf PRIMER_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-p <primer_file>

A FASTA file containing primer sequences.

--start <start>

The starting position of the primer.

--maxerror <max_error>

Maximum allowable error rate.

--revpr

Specify to match the tail-end of the sequence against the reverse complement of the primers. This also reverses the behavior of the –start argument, such that start position is relative to the tail-end of the sequence.

--mode {cut,mask,trim,tag}

Specifies the action to take with the primer sequence. The “cut” mode will remove both the primer region and the preceding sequence. The “mask” mode will replace the primer region with Ns and remove the preceding sequence. The “trim” mode will remove the region preceding the primer, but leave the primer region intact. The “tag” mode will leave the input sequence unmodified.

--barcode

Specify to annotate reads sequences with barcode sequences (unique molecular identifiers) found preceding the primer.

--bf <barcode_field>

Name of the barcode annotation field.

--pf <primer_field>

Name of the annotation field containing the primer name.

PairSeq.py

Sorts and matches sequence records with matching coordinates across files

usage: PairSeq.py [--version] [-h] -1 SEQ_FILES_1 [SEQ_FILES_1 ...] -2
                  SEQ_FILES_2 [SEQ_FILES_2 ...] [--outdir OUT_DIR]
                  [--outname OUT_NAME] [--failed] [--fasta]
                  [--delim DELIMITER DELIMITER DELIMITER]
                  [--1f FIELDS_1 [FIELDS_1 ...]]
                  [--2f FIELDS_2 [FIELDS_2 ...]] [--act {min,max,sum,set,cat}]
                  [--coord {illumina,solexa,sra,454,presto}]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--1f <fields_1>

The annotation fields to copy from file 1 records into file 2 records. If a copied annotation already exists in a file 2 record, then the annotations copied from file 1 will be added to the front of the existing annotation.

--2f <fields_2>

The annotation fields to copy from file 2 records into file 1 records. If a copied annotation already exists in a file 1 record, then the annotations copied from file 2 will be added to the end of the existing annotation.

--act {min,max,sum,set,cat}

The collapse actions to take on all fields copied between files to combine duplicate fields into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string. Only applies if the field already exists in the header before being copying from the other file.

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across mate pairs.

output files:
pair-pass

successfully paired reads with modified annotations.

pair-fail

raw reads that could not be assigned to a mate-pair.

output annotation fields:
<user defined>

annotation fields specified by the –1f or –2f arguments.

ParseHeaders.py

Parses pRESTO annotations in FASTA/FASTQ sequence headers

usage: ParseHeaders.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
reheader-pass

reads passing annotation operation and modified accordingly.

reheader-fail

raw reads failing annotation operation.

headers

tab delimited table of the selected annotations.

output annotation fields:
<user defined>

annotation fields specified by the -f argument.

ParseHeaders.py add

Adds field/value pairs to header annotations

usage: ParseHeaders.py add [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--failed] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                           [FIELDS ...] -u VALUES [VALUES ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to add.

-u <values>

List of values to add for each field.

ParseHeaders.py collapse

Collapses header annotations with multiple entries

usage: ParseHeaders.py collapse [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                                [-o OUT_FILES [OUT_FILES ...]]
                                [--outdir OUT_DIR] [--outname OUT_NAME]
                                [--failed] [--fasta]
                                [--delim DELIMITER DELIMITER DELIMITER] -f
                                FIELDS [FIELDS ...] --act
                                {min,max,sum,first,last,set,cat}
                                [{min,max,sum,first,last,set,cat} ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to collapse.

--act {min,max,sum,first,last,set,cat}

List of actions to take for each field defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

ParseHeaders.py copy

Copies header annotation fields

usage: ParseHeaders.py copy [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--failed] [--fasta]
                            [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                            [FIELDS ...] -k NAMES [NAMES ...]
                            [--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to copy.

-k <names>

List of names for each copied field. If the new field is already present, the copied field will be merged into the existing field.

--act {min,max,sum,first,last,set,cat}

List of collapse actions to take on each new field following the copy operation defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

ParseHeaders.py delete

Deletes fields from header annotations

usage: ParseHeaders.py delete [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER] -f
                              FIELDS [FIELDS ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to delete.

ParseHeaders.py expand

Expands annotation fields with multiple values

usage: ParseHeaders.py expand [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER] -f
                              FIELDS [FIELDS ...] [--sep SEPARATOR]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to expand.

--sep <separator>

The character separating each value in the fields.

ParseHeaders.py merge

Merge multiple annotations fields into a single field

usage: ParseHeaders.py merge [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--failed] [--fasta]
                             [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                             [FIELDS ...] -k NAME
                             [--act {min,max,sum,set,cat}] [--delete]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to merge.

-k <name>

Name for the merged field. If the new field is already present, the merged fields will be merged into the existing field.

--act {min,max,sum,set,cat}

List of collapse actions to take on the new field following the merge defining how to combine the annotations into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

--delete

If specified, delete the field that were merged from the output header.

ParseHeaders.py rename

Renames header annotation fields

usage: ParseHeaders.py rename [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER] -f
                              FIELDS [FIELDS ...] -k NAMES [NAMES ...]
                              [--act {min,max,sum,first,last,set,cat} [{min,max,sum,first,last,set,cat} ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to rename.

-k <names>

List of new names for each field. If the new field is already present, the renamed field will be merged into the existing field and the old field will be deleted.

--act {min,max,sum,first,last,set,cat}

List of collapse actions to take on each new field following the rename operation defining how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The actions “first” and “last” choose the value from the corresponding position in the annotation. The action “set” collapses annotations into a comma delimited list of unique values. The action “cat” concatenates the values together into a single string.

ParseHeaders.py table

Writes sequence headers to a table

usage: ParseHeaders.py table [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--failed]
                             [--delim DELIMITER DELIMITER DELIMITER] -f FIELDS
                             [FIELDS ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <fields>

List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.

ParseLog.py

Parses records in the console log of pRESTO modules

usage: ParseLog.py [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
                   [--outdir OUT_DIR] [--outname OUT_NAME]
                   [--delim DELIMITER DELIMITER DELIMITER] -l RECORD_FILES
                   [RECORD_FILES ...] -f FIELDS [FIELDS ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-l <record_files>

List of log files to parse.

-f <fields>

List of fields to collect. The sequence identifier may be specified using the hidden field name “ID”.

output files:
table

tab delimited table of the selected annotations.

output annotation fields:
<user defined>

annotation fields specified by the -f argument.

SplitSeq.py

Sorts, samples and splits FASTA/FASTQ sequence files

usage: SplitSeq.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
part<part>

reads partitioned by count, where <part> is the partition number.

<field>-<value>

reads partitioned by annotation <field> and <value>.

under-<number>

reads partitioned by numeric threshold where the annotation value is strictly less than the threshold <number>.

atleast-<number>

reads partitioned by numeric threshold where the annotation value is greater than or equal to the threshold <number>.

sorted

reads sorted by annotation value.

sorted-part<part>

reads sorted by annotation value and partitioned by count, where <part> is the partition number.

sample<i>-n<count>

randomly sampled reads where <i> is a number specifying the sampling instance and <count> is the number of sampled reads.

selected

reads passing selection criteria.

output annotation fields:

None

SplitSeq.py count

Splits sequences files by number of records.

usage: SplitSeq.py count [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta] -n
                         MAX_COUNT
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

-n <max_count>

Maximum number of sequences in each new file

SplitSeq.py group

Splits sequences files by annotation.

usage: SplitSeq.py group [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                         [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                         [--num THRESHOLD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>

Annotation field to split sequence files by

--num <threshold>

Specify to define the split field as numeric and group sequences by value.

SplitSeq.py sample

Randomly samples from unpaired sequences files.

usage: SplitSeq.py sample [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER] -n MAX_COUNT
                          [MAX_COUNT ...] [-f FIELD] [-u VALUES [VALUES ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_count>

Maximum number of sequences to sample from each file, field or annotation set. The default behavior, without the -f argument, is to sample from the complete set of sequences in the input file.

-f <field>

The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.

-u <values>

If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.

SplitSeq.py samplepair

Randomly samples from paired-end sequences files.

usage: SplitSeq.py samplepair [--version] [-h] -1 SEQ_FILES_1
                              [SEQ_FILES_1 ...] -2 SEQ_FILES_2
                              [SEQ_FILES_2 ...] [--outdir OUT_DIR]
                              [--outname OUT_NAME] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER] -n
                              MAX_COUNT [MAX_COUNT ...] [-f FIELD]
                              [-u VALUES [VALUES ...]]
                              [--coord {illumina,solexa,sra,454,presto}]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-1 <seq_files_1>

An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>

An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_count>

Maximum number of paired sequences to sample from each set of files, fields or annotations. The default behavior, without the -f argument, is to sample from the complete set of paired sequences in the input files.

-f <field>

The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.

-u <values>

If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.

--coord {illumina,solexa,sra,454,presto}

The format of the sequence identifier which defines shared coordinate information across paired read files.

SplitSeq.py select

Selects sequences from sequence files by annotation.

usage: SplitSeq.py select [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                          [-u VALUE_LIST [VALUE_LIST ...] | -t VALUE_FILE]
                          [--not]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>

The annotation field for selection criteria.

-u <value_list>

A list of values to select for in the specified field. Mutually exclusive with -t.

-t <value_file>

A tab delimited file specifying values to select for in the specified field. The file must be formatted with the given field name in the header row. Values will be taken from that column. Mutually exclusive with -u.

--not

If specified, will perform negative matching. Meaning, sequences will be selected if they fail to match for all specified values.

SplitSeq.py sort

Sorts sequences files by annotation.

usage: SplitSeq.py sort [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                        [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                        [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                        [-n MAX_COUNT] [--num]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>

The annotation field to sort sequences by.

-n <max_count>

Maximum number of sequences in each new file.

--num

Specify to define the sort field as numeric rather than textual.

UnifyHeaders

Unifies annotation fields based on grouping scheme

usage: UnifyHeaders [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
unify-pass

Reads passing annotation filtering or consensus.

unify-fail

Reading failing filtering.

output annotation fields:
<user defined>

annotation fields specified by the -f and -k arguments.

UnifyHeaders consensus

Reassign fields to consensus values.

usage: UnifyHeaders consensus [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--log LOG_FILE] [--failed] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
                              [--nproc NPROC] [-f SET_FIELD] [-k UNIFY_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-f <set_field>

The annotation field containing annotations, such as the UMI barcode, for sequence grouping.

-k <unify_field>

The name of the annotation field to find a consensus for per each sequence group.

UnifyHeaders delete

Delete sequences with differing field values.

usage: UnifyHeaders delete [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                           [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [-f SET_FIELD] [-k UNIFY_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-f <set_field>

The annotation field containing annotations, such as the UMI barcode, for sequence grouping.

-k <unify_field>

The name of the annotation field to find a consensus for per each sequence group.

API

presto.Annotation

Annotation functions

presto.Annotation.addHeader(header, fields, values, delimiter=('|', '=', ','))

Adds fields and values to a sequence header

Parameters
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the list of fields to add or append to.

  • values – the list of annotation values to add for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns

modified header dictionary.

Return type

dict

presto.Annotation.annotationConsensus(seq_iter, field, delimiter=('|', '=', ','))

Calculate a consensus annotation for a set of sequences

Parameters
  • seq_iter – an iterator or list of SeqRecord objects

  • field – the annotation field to take a consensus of

  • delimiter – a tuple of delimiters for (annotations, field/values, value lists)

Returns

Dictionary with keys

set containing a list of unique annotation values, count containing annotation counts, cons containing the consensus annotation, freq containing the majority annotation frequency

Return type

dict

presto.Annotation.collapseAnnotation(ann_dict, action, fields=None, delimiter=('|', '=', ','))

Collapses multiple annotations into new single annotations for each field

Parameters
  • ann_dict – dictionary of field/value pairs

  • action – collapse action to take; one of {min, max, sum, first, last, set, cat}

  • fields – subset of ann_dict to _collapse; if None, collapse all but the ID field

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns

Modified field dictionary

Return type

OrderedDict

presto.Annotation.collapseHeader(header, fields, actions, delimiter=('|', '=', ','))

Collapses a sequence header

Parameters
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the list of fields to collapse.

  • actions – the list of collapse action take; one of (max, min, sum, first, last, set, cat) for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns

modified header dictionary.

Return type

dict

presto.Annotation.convert454Header(desc)

Parses 454 headers into the pRESTO format

Parameters

desc (str) – a sequence description string.

Returns

a dictionary of header field and value pairs.

Return type

dict

Examples

New style 454 header:

@<accession> <length=##>
@GXGJ56Z01AE06X length=222

Old style 454 header:

@<rank_x_y> <length=##> <uaccno=accession>
@000034_0199_0169 length=437 uaccno=GNDG01201ARRCR
presto.Annotation.convertGenbankHeader(desc, delimiter=('|', '=', ','))

Converts GenBank and RefSeq headers into the pRESTO format

Parameters
  • desc (str) – a sequence description string.

  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).

Returns

a dictionary of header field and value pairs.

Return type

dict

Examples

New style GenBank header:

<accession>.<version> <description>
>CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly

Old style GenBank header:

gi|<GI record number>|<dbsrc>|<accession>.<version>|<description>
>gi|568336023|gb|CM000663.2| Homo sapiens chromosome 1, GRCh38 reference primary assembly
presto.Annotation.convertGenericHeader(desc, delimiter=('|', '=', ','))

Converts any header to the pRESTO format

Parameters
  • desc (str) – a sequence description string.

  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).

Returns

a dictionary of header field and value pairs.

Return type

dict

presto.Annotation.convertIMGTHeader(desc, simple=False)

Converts germline headers from IMGT/GENE-DB into the pRESTO format

Parameters
  • desc (str) – a sequence description string.

  • simple (bool) – if True then the header will be converted to only the allele name.

Returns

a dictionary of header field and value pairs.

Return type

dict

Examples

IMGT header:

>X60503|IGHV1-18*02|Homo sapiens|F|V-REGION|142..417|276 nt|1| | | | |276+24=300|partial in 3'| |

Header contains 15 fields separated by | (http://imgt.org/genedb):

  1. IMGT/LIGM-DB accession number(s).

  2. Gene and allele name.

  3. Species.

  4. Functionality.

  5. Exon(s), region name(s), or extracted label(s).

  6. Start and end positions in the IMGT/LIGM-DB accession number(s).

  7. Number of nucleotides in the IMGT/LIGM-DB accession number(s).

  8. Codon start, or ‘NR’ (not relevant) for non coding labels and out-of-frame pseudogenes.

  9. Number of nucleotides added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB.

  10. Number of nucleotides added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB.

  11. Number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or ‘not corrected’ if non corrected sequencing errors.

  12. Number of amino acids (AA). This field indicates that the sequence is in amino acids.

  13. Number of characters in the sequence. Nucleotides (or AA) plus IMGT gaps.

  14. Partial (if it is).

  15. Reverse complementary (if it is).

presto.Annotation.convertIlluminaHeader(desc)

Converts Illumina headers into the pRESTO format

Parameters

desc (str) – a sequence description string.

Returns

a dictionary of header field and value pairs.

Return type

dict

Examples

New style Illumina header:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read number>:<is filtered>:<control number>:<index sequence>
@MISEQ:132:000000000-A2F3U:1:1101:14340:1555 2:N:0:ATCACG

Old style Illumina header:

@<instrument>:<flowcell lane>:<tile>:<x-pos>:<y-pos>#<index sequence>/<read number>
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
@MS6_33112:1:1101:18371:1066/1
presto.Annotation.convertMIGECHeader(desc)

Parses headers from the MIGEC tool into the pRESTO format

Parameters

desc (str) – a sequence description string.

Returns

a dictionary of header field and value pairs.

Return type

dict

Examples

MIGEC header:

@MIG UMI:<UMI sequence>:<consensus read count>
@MIG UMI:TCGGCCAACAAA:8
presto.Annotation.convertSRAHeader(desc)

Parses NCBI SRA or EMBL-EBI ENA headers into the pRESTO format

Parameters

desc (str) – a sequence description string.

Returns

a dictionary of header field and value pairs.

Return type

dict

Examples

Header from fastq-dump --split-files:

@<accession>.<spot> <original sequence description> <length=#>
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
@SRR1383326.1 1 length=250

Header from fastq-dump --split-files -I:

@<accession>.<spot>.<read number> <original sequence description> <length=#>
@SRR1383326.1.1 1 length=250

Header from ENA:

@<accession>.<spot> <original sequence description>
@ERR220397.1 HKSQ1MM01DXT2W/3
@ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/1
@ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/2
presto.Annotation.copyHeader(header, fields, names, actions=None, delimiter=('|', '=', ','))

Copies fields in a sequence header

Parameters
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – a list of the field names to copy.

  • names – a list of the new field names.

  • actions – the list of collapse action take after the copy; one of (max, min, sum, first, last, set, cat) for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns

modified header dictionary.

Return type

dict

presto.Annotation.deleteHeader(header, fields, delimiter=('|', '=', ','))

Deletes fields from a sequence header

Parameters
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the list of fields to delete.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns

modified header dictionary

Return type

dict

presto.Annotation.expandHeader(header, fields, separator=',', delimiter=('|', '=', ','))

Splits and annotation value into separate fields in a sequence header

Parameters
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the field to split.

  • separator – the delimiter to split the values by.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns

modified header dictionary.

Return type

dict

presto.Annotation.flattenAnnotation(ann_dict, delimiter=('|', '=', ','))

Converts annotations from a dictionary to a FASTA/FASTQ sequence description

Parameters
  • ann_dict – Dictionary of field/value pairs

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns

Formatted sequence description string

Return type

str

presto.Annotation.getAnnotationValues(seq_iter, field, unique=False, delimiter=('|', '=', ','))

Gets the set of unique annotation values in a sequence set

Parameters
  • seq_iter – Iterator or list of SeqRecord objects

  • field – Annotation field to retrieve values for

  • unique – If True return a list of only the unique values; if False return a list of all values

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns

List of values for the field

Return type

list

presto.Annotation.getCoordKey(header, coord_type='presto', delimiter=('|', '=', ','))

Return the coordinate identifier for a sequence description

Parameters
  • header (str) – Sequence header string

  • coord_type (str) – Sequence header format; one of ‘illumina’, ‘solexa’, ‘sra’, ‘ena’, ‘454’, or ‘presto’; if unrecognized type or None, then return the input header.

  • delimiter (tuple) – Tuple of delimiters for (fields, values, value lists)

Returns

Coordinate identifier as a string.

Return type

str

presto.Annotation.mergeAnnotation(ann_dict_1, ann_dict_2, prepend=False, delimiter=('|', '=', ','))

Merges non-ID field annotations from one field dictionary into another

Parameters
  • ann_dict_1 – Dictionary of field/value pairs to append to

  • ann_dict_2 – Dictionary of field/value pairs to merge with ann_dict_2

  • prepend – If True then add ann_dict_2 values to the front of any ann_dict_1 values that are already present, rather than the default behavior of appending ann_dict_2 values.

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns

Modified ann_dict_1 dictonary of field/value pairs

Return type

OrderedDict

presto.Annotation.mergeHeader(header, fields, name, action=None, delete=False, delimiter=('|', '=', ','))

Merges fields in a sequence header

Parameters
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – a list of the field names to merge.

  • name – the name of the new field.

  • delete – if True delete the merged fields.

  • actions – the list of collapse action take after the merge one of (max, min, sum, first, last, set, cat).

  • delimiter – a tuple of delimiters for (fields, values, value lists)

Returns

modified header dictionary.

Return type

dict

presto.Annotation.parseAnnotation(record, fields=None, delimiter=('|', '=', ','))

Extracts annotations from a FASTA/FASTQ sequence description

Parameters
  • record – Description string to extract annotations from

  • fields – List of fields to subset the return dictionary to; if None return all fields

  • delimiter – a tuple of delimiters for (fields, values, value lists)

Returns

An OrderedDict of field/value pairs

Return type

OrderedDict

presto.Annotation.parseLog(record)

Parses an pRESTO log record

Parameters

record (str) – a string of lines representing a log record including newline characters.

Returns

parsed log contain field and values pairs as a dictionary.

Return type

collections.OrderedDict

presto.Annotation.renameAnnotation(ann_dict, old_field, new_field, delimiter=('|', '=', ','))

Renames an annotation and merges annotations if the new name already exists

Parameters
  • ann_dict – Dictionary of field/value pairs

  • old_field – Old field name

  • new_field – New field name

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns

Modified fields dictonary

Return type

OrderedDict

presto.Annotation.renameHeader(header, fields, names, actions=None, delimiter=('|', '=', ','))

Renames fields in a sequence header

Parameters
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – a list of the current field names.

  • names – a list of the new field names.

  • actions – the list of collapse action take after the rename; one of (max, min, sum, first, last, set, cat) for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns

modified header dictionary.

Return type

dict

presto.Applications

External application wrappers

presto.Applications.makeBlastnDb(ref_file, db_exec='makeblastdb')

Makes a blastn database file

Parameters
  • ref_file – the path to the reference database file

  • db_exec – the path to the makeblastdb executable

Returns

(name and location of the database, handle of the tempfile.TemporaryDirectory)

Return type

tuple

presto.Applications.makeUBlastDb(ref_file, db_exec='usearch')

Makes a ublast database file

Parameters
  • ref_file – path to the reference database file.

  • db_exec – path to the usearch executable.

Returns

(location of the database, handle of the tempfile.NamedTemporaryFile)

Return type

tuple

presto.Applications.runBlastn(seq, database, evalue=1e-05, max_hits=100, aligner_exec='blastn')

Aligns a sequence against a reference database using BLASTN

Parameters
  • seq – a list of SeqRecord objects to align.

  • database – the path and name of the blastn database.

  • evalue – the E-value cut-off.

  • maxhits – the maximum number of hits returned.

  • aligner_exec – the path to the blastn executable.

Returns

Alignment results.

Return type

pandas.DataFrame

presto.Applications.runCDHit(seq_list, ident=0.9, length_ratio=0.0, seq_start=0, seq_end=None, max_memory=3000, threads=1, cluster_exec='cd-hit-est', min_word_match=12)

Cluster a set of sequences using CD-HIT

Parameters
  • seq_list (list) – a list of SeqRecord objects to align.

  • ident (float) – the sequence identity cutoff to be passed to cd-hit-est.

  • length_ratio (float) – cd-hit-est parameter defining the minimum short/long length ratio allowed within a cluster.

  • seq_start (int) – the start position to trim sequences at before clustering.

  • seq_end (int) – the end position to trim sequences at before clustering.

  • max_memory (int) – cd-hit-est max memory limit (Mb)

  • threads (int) – number of threads for cd-hit-est.

  • cluster_exec (str) – the path to the cd-hit-est executable.

  • min_word_match (int) – (not used for this function, see runUClust)

Returns

{cluster id: list of sequence ids}.

Return type

dict

presto.Applications.runMuscle(seq_list, aligner_exec='muscle')

Multiple aligns a set of sequences using MUSCLE

Parameters
  • seq_list – a list of SeqRecord objects to align

  • aligner_exec – the MUSCLE executable

Returns

Multiple alignment results.

Return type

Bio.Align.MultipleSeqAlignment

presto.Applications.runUBlast(seq, database, evalue=1e-05, max_hits=100, aligner_exec='usearch')

Aligns a sequence against a reference database using the usearch_local algorithm of USEARCH

Parameters
  • seq – a list of SeqRecord objects to align.

  • database – the path to the ublast database or a fasta file.

  • evalue – the E-value cut-off.

  • maxhits – the maximum number of hits returned.

  • aligner_exec – the path to the usearch executable.

Returns

Alignment results.

Return type

pandas.DataFrame

presto.Applications.runUClust(seq_list, ident=0.9, length_ratio=0.0, seq_start=0, seq_end=None, max_memory=3000, threads=1, cluster_exec='usearch', min_word_match=12)

Cluster a set of sequences using the UCLUST algorithm from USEARCH

Parameters
  • seq_list (list) – a list of SeqRecord objects to align.

  • ident (float) – the sequence identity cutoff to be passed to usearch.

  • length_ratio (float) – usearch parameter defining the minimum short/long length ratio allowed within a cluster.

  • seq_start (int) – the start position to trim sequences at before clustering.

  • seq_end (int) – the end position to trim sequences at before clustering.

  • max_memory (int) – currently ignored.

  • threads (int) – number of threads for usearch.

  • cluster_exec (str) – the path to the usearch executable.

  • min_word_match (int) – minimum number of words that must match to be clustered.

Returns

{cluster id: list of sequence ids}.

Return type

dict

presto.Commandline

Commandline interface functions

class presto.Commandline.CommonHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: argparse.RawDescriptionHelpFormatter, argparse.ArgumentDefaultsHelpFormatter

Custom argparse.HelpFormatter

presto.Commandline.checkArgs(parser)

Checks that arguments have been provided and prints help if they have not.

Parameters

parser – An argparse.ArgumentParser defining the commandline arguments.

Returns

True if arguments are present. Prints help and exits if not.

Return type

boolean

presto.Commandline.getCommonArgParser(seq_in=True, seq_out=True, seq_paired=False, db_in=False, db_out=False, out_file=True, failed=True, log=True, annotation=True, multiproc=False, add_help=True)

Defines an ArgumentParser object with common pRESTO arguments

Parameters
  • seq_in (bool) – if True include sequence input arguments.

  • seq_out (bool) – if True include sequence output arguments.

  • seq_paired (bool) – if True defined paired-end sequence input and output arguments.

  • db_in (bool) – if True include tab-delimited database input arguments.

  • db_out (bool) – if True include tab-delimited database output arguments.

  • out_file (bool) – if True add explicit output file name arguments.

  • failed (bool) – If True include arguments for output of failed results.

  • log (bool) – If True include log arguments.

  • annotation (bool) – If True include annotation arguments.

  • multiproc (bool) – If True include multiprocessing arguments.

  • add_help (bool) – If True add help and version arguments.

Returns

ArgumentParser object.

Return type

argparse.ArgumentParser

presto.Commandline.parseCommonArgs(args, in_arg=None, in_types=None)

Checks common arguments from getCommonArgParser and transforms output options to a dictionary

Parameters
  • args – Argument Namespace defined by ArgumentParser.parse_args

  • in_arg – String defining a non-standard input file argument to verify; by default [‘db_files’, ‘seq_files’, ‘seq_files_1’, ‘seq_files_2’, ‘primer_file’] are supported in that order

  • in_types – List of types (file extensions as strings) to allow for files in file_arg if None do not check type

Returns

Dictionary copy of args with output arguments embedded in the dictionary out_args

Return type

dict

presto.IO

File I/O and logging functions

presto.IO.countSeqFile(seq_file)

Counts the records in FASTA/FASTQ files

Parameters

seq_file – FASTA or FASTQ file containing sample sequences

Returns

Count of records in the sequence file

Return type

int

presto.IO.countSeqSets(seq_file, field='BARCODE', delimiter=('|', '=', ','))

Identifies sets of sequences with the same ID field

Parameters
  • seq_file – FASTA or FASTQ file containing sample sequences

  • field – Annotation field containing set IDs

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns

Count of unit set IDs in the sequence file

Return type

int

presto.IO.getFileType(filename)

Determines the type of a file by file extension

Parameters

filename – Filename

Returns

String defining the sequence type for SeqIO operations

Return type

str

presto.IO.getOutputHandle(in_file, out_label=None, out_dir=None, out_name=None, out_type=None)

Opens an output file handle

Parameters
  • in_file – Input filename

  • out_label – Text to be inserted before the file extension; if None do not add a label

  • out_type – the file extension of the output file; if None use input file extension

  • out_dir – the output directory; if None use directory of input file

  • out_name – the short filename to use for the output file; if None use input file short name

Returns

File handle

Return type

file

presto.IO.printCount(current, step, start_time=None, task=None, end=False)

Prints a progress bar to standard out

Parameters
  • current (int) – count of completed tasks.

  • step (int) – an int defining the progress increment to print at.

  • start_time (time.time) – task start time returned by time.time(); if None do not add run time to progress

  • task (str) – name of task to display.

  • end (bool) – if True print final log (add newline).

presto.IO.printDebug(message, debug=True, newline=False)

Prints a debug message to standard error

Parameters
  • message (str) – message.

  • debug (bool) – if True print the message.

  • newline (bool) – prefix with a newline if True.

presto.IO.printError(message, exit=True, newline=False)

Prints an error to standard error and exits

Parameters
  • message (str) – error message.

  • exit (bool) – if True exit after the message.

  • newline (bool) – prefix with a newline if True.

presto.IO.printLog(record, handle=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, inset=None)

Formats a dictionary into a log string

Parameters
  • record – a dict or OrderedDict of field names mapping to values.

  • handle – the file handle to write the log to; if None do not write to file.

  • inset – minimum field name inset; if None automatically space field names.

Returns

Formatted multi-line string in the log format.

Return type

str

presto.IO.printMessage(message, start_time=None, end=False, width=25)

Prints a progress message to standard out

Parameters
  • message – Current task message

  • start_time – task start time returned by time.time(); if None do not add run time to progress

  • end – If True print final message (add newline)

  • width – Maximum number of characters for messages

presto.IO.printProgress(current, total, step, start_time=None, task=None, end=False)

Prints a progress bar to standard out

Parameters
  • current (int) – count of completed tasks.

  • total (int) – total task count.

  • step (float) – float defining the fractional progress increment to print at.

  • start_time (float) – task start time returned by time.time(); if None do not add run time to progress

  • task (str) – name of task to display.

  • end (bool) – if True print final log (add newline).

presto.IO.printWarning(message, newline=False)

Prints a warning to standard error

Parameters
  • message (str) – warning message.

  • newline (bool) – prefix with a newline if True.

presto.IO.readPrimerFile(primer_file, replace_special=True)

Processes primer sequences from file

Parameters
  • primer_file (str) – name of the FASTA file containing primer sequences.

  • replace_special (bool) – if True replace whitespace and pRESTO annotation delimiters with an underscore.

Returns

Dictionary mapping primer ID to sequence.

Return type

dict

presto.IO.readReferenceFile(ref_file)

Create a dictionary of cleaned and ungapped reference sequences.

Parameters

ref_file – reference sequences in fasta format.

Returns

cleaned and ungapped reference sequences;

with the key as the sequence ID and value as a Bio.SeqRecord for each reference sequence.

Return type

dict

presto.IO.readSeqFile(seq_file, index=False, key_func=None)

Reads FASTA/FASTQ files

Parameters
  • seq_file – FASTA or FASTQ file containing sample sequences

  • index – If True return a dictionary from SeqIO.index(); if False return an iterator from SeqIO.parse()

  • key_func – the key_function argument to pass to SeqIO.index if index=True

Returns

an interator of SeqRecords if index=False. A dict if True.

Return type

iter

presto.Multiprocessing

Multiprocessing functions

class presto.Multiprocessing.SeqData(id, data)

Bases: object

Class defining sequence data objects for worker processes

id

unique identifier

data

single object or a list of data objects.

valid

if True data is suitable for processing.

__bool__()

Boolean evaluation

Returns

True if the valid attribute is True

Return type

bool

__len__()

Length evaluation

Returns

number of objects in the data attribute.

Return type

int

class presto.Multiprocessing.SeqResult(id, data)

Bases: object

Class defining sequence result objects for collector processes

id

unique identifier

data

single unprocessed object or a list of unprocessed data objects.

results

single processed object or a list of processed data objects.

valid

if True processing was successful.

log

dictionary containing the processing log.

__bool__()

Boolean evaluation

Returns

True if the valid attribute is True.

Return type

bool

__len__()

Length evaluation

Returns

number of objects in the results attribute.

Return type

int

property data_count

Data length

Returns

number of objects in the data attribute.

Return type

int

presto.Multiprocessing.collectPairQueue(alive, result_queue, collect_queue, seq_file_1, seq_file_2, label, out_file=None, out_args={'delimiter': ('|', '=', ','), 'failed': True, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': None, 'separator': ','})

Pulls from results queue, assembles results and manages log and file IO

Parameters
  • alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns.

  • result_queue – a multiprocessing.Queue holding worker results.

  • collect_queue – a multiprocessing.Queue holding collector return values.

  • seq_file_1 – the first sequence file name.

  • seq_file_2 – the second sequence file name.

  • label – task label used to tag the output files.

  • out_file – output file name. Automatically generated from the input file if None.

  • out_args – common output argument dictionary from parseCommonArgs.

Returns

adds a dictionary of {log: log object, out_files: output file names} to collect_queue.

Return type

None

presto.Multiprocessing.collectSeqQueue(alive, result_queue, collect_queue, seq_file, label, index_field=None, out_file=None, out_args={'delimiter': ('|', '=', ','), 'failed': True, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': None, 'separator': ','})

Pulls from results queue, assembles results and manages log and file IO

Parameters
  • alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns.

  • result_queue – Multiprocessing.Queue holding worker results.

  • collect_queue – Multiprocessing.Queue to store collector return values.

  • seq_file – sample sequence file name.

  • label – task label used to tag the output files.

  • out_file – output file name. Automatically generated from the input file if None.

  • out_args – Common output argument dictionary from parseCommonArgs.

  • index_field – Field defining set membership for sequence sets if None data queue contained individual records.

Returns

Adds a dictionary with key value pairs to collect_queue containing

’log’ defining a log object, ‘out_files’ defining the output file names

Return type

None

presto.Multiprocessing.feedPairQueue(alive, data_queue, seq_file_1, seq_file_2, coord_type='presto', delimiter=('|', '=', ','))

Feeds the data queue with sequence pairs for processQueue processes

Parameters
  • alive – a multiprocessing.Value boolean controlling whether processing continues; when False function returns

  • data_queue – an multiprocessing.Queue to hold data for processing

  • seq_file_1 – the name of sequence file 1

  • seq_file_2 – the name of sequence file 2

  • coord_type – the sequence header format

  • delimiter – a tuple of delimiters for (fields, values, value lists)

Returns

None

presto.Multiprocessing.feedSeqQueue(alive, data_queue, seq_file, index_func=None, index_args={})

Feeds the data queue with SeqRecord objects

Parameters
  • alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns

  • data_queue – multiprocessing.Queue to hold data for processing

  • seq_file – Sequence file to read input from

  • index_func – Function to use to define sequence sets if None do not index sets and feed individual records

  • index_args – Dictionary of arguments to pass to index_func

Returns

None

presto.Multiprocessing.manageProcesses(feed_func, work_func, collect_func, feed_args={}, work_args={}, collect_args={}, nproc=None, queue_size=None)

Manages feeder, worker and collector processes

Parameters
  • feed_func (function) – Data Queue feeder function.

  • work_func (function) – Worker function.

  • collect_func (function) – Result Queue collector function.

  • feed_args (dict) – Dictionary of arguments to pass to feed_func.

  • work_args (dict) – Dictionary of arguments to pass to work_func.

  • collect_args (dict) – Dictionary of arguments to pass to collect_func.

  • nproc (int) – Number of processQueue processes; if None defaults to the number of CPUs

  • queue_size (int) – Maximum size of the argument queue; if None defaults to 2*nproc

Returns

Dictionary of collector results

Return type

dict

presto.Multiprocessing.processSeqQueue(alive, data_queue, result_queue, process_func, process_args={})

Pulls from data queue, performs calculations, and feeds results queue

Parameters
  • alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns

  • data_queue – multiprocessing.Queue holding data to process

  • result_queue – multiprocessing.Queue to hold processed results

  • process_func – function to use for processing sequences

  • process_args – Dictionary of arguments to pass to process_func

Returns

None

presto.Sequence

Sequence processing functions

class presto.Sequence.AssemblyRecord(seq=None)

Bases: object

A class defining a paired-end assembly result

property overlap
class presto.Sequence.AssemblyStats(n)

Bases: object

Class containing p-value and z-score matrices for scoring assemblies

class presto.Sequence.PrimerAlignment(seq=None)

Bases: object

A class defining a primer alignment result

__bool__()

Boolean evaluation of the alignment

Returns

evaluates to the value of the valid attribute

Return type

int

__len__()

Length of alignment

Returns

length of align_seq attribute

Return type

int

presto.Sequence.alignAssembly(head_seq, tail_seq, alpha=1e-05, max_error=0.3, min_len=8, max_len=1000, scan_reverse=False, assembly_stats=None, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'U'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'U'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'U'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'U'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'U'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'U'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'U'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'U'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'U'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'U'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'U'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'U'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'U'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'U'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('U', '-'): 0, ('U', '.'): 0, ('U', 'A'): 0, ('U', 'B'): 1, ('U', 'C'): 0, ('U', 'D'): 1, ('U', 'G'): 0, ('U', 'H'): 1, ('U', 'K'): 1, ('U', 'M'): 0, ('U', 'N'): 1, ('U', 'R'): 0, ('U', 'S'): 0, ('U', 'T'): 1, ('U', 'U'): 1, ('U', 'V'): 0, ('U', 'W'): 1, ('U', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'U'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'U'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'U'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Stitches two sequences together by aligning the ends

Parameters
  • head_seq – the head SeqRecord.

  • head_seq – the tail SeqRecord.

  • alpha – the minimum p-value for a valid assembly.

  • max_error – the maximum error rate for a valid assembly.

  • min_len – minimum length of overlap to test.

  • max_len – maximum length of overlap to test.

  • scan_reverse – if True allow the head sequence to overhang the end of the tail sequence if False end alignment scan at end of tail sequence or start of head sequence.

  • assembly_stats – optional successes by trials numpy.array of p-values.

  • score_dict – optional dictionary of character scores in the . form {(char1, char2): score}.

Returns

assembled sequence object.

Return type

AssemblyRecord

presto.Sequence.calculateDiversity(seq_list, score_dict=getDNAScoreDict())

Determine the average pairwise error rate for a list of sequences

Parameters
  • seq_list – List of SeqRecord objects to score

  • score_dict – Optional dictionary of alignment scores as {(char1, char2): score}

Returns

Average pairwise error rate for the list of sequences

Return type

float

presto.Sequence.calculateSetError(seq_list, ref_seq, ignore_chars=['n', 'N'], score_dict=getDNAScoreDict())

Counts the occurrence of nucleotide mismatches from a reference in a set of sequences

Parameters
  • seq_list – list of SeqRecord objects with aligned sequences.

  • ref_seq – SeqRecord object containing the reference sequence to match against.

  • ignore_chars – list of characters to exclude from mismatch counts.

  • score_dict – optional dictionary of alignment scores as {(char1, char2): score}.

Returns

error rate for the set.

Return type

float

presto.Sequence.checkSeqEqual(seq1, seq2, ignore_chars={'-', '.', 'N', 'n'})

Determine if two sequences are equal, excluding missing positions

Parameters
  • seq1 – SeqRecord object

  • seq2 – SeqRecord object

  • ignore_chars – Set of characters to ignore

Returns

True if the sequences are equal

Return type

bool

presto.Sequence.compilePrimers(primers)

Translates IUPAC Ambiguous Nucleotide characters to regular expressions and compiles them

Parameters

key – Dictionary of sequences to translate

Returns

Dictionary of compiled regular expressions

Return type

dict

presto.Sequence.consensusUnify(data, field, delimiter=('|', '=', ','))

Reassigns all annotations to the consensus annotation in group

Parameters
  • data – SeqData object contain sequences to process.

  • field – field containing annotations to collapse.

  • delimiter – a tuple of delimiters for (annotations, field/values, value lists).

Returns

modified sequences.

Return type

SeqResult

presto.Sequence.deleteSeqPositions(seq, positions)

Deletes a list of positions from a SeqRecord

Parameters
  • seq – SeqRecord objects

  • positions – Set of positions (indices) to delete

Returns

Modified SeqRecord with the specified positions removed

Return type

SeqRecord

presto.Sequence.deletionUnify(data, field, delimiter=('|', '=', ','))

Removes all sequences with differing annotations in a group

Parameters
  • data – SeqData object contain sequences to process.

  • field – field containing annotations to collapse.

  • delimiter – a tuple of delimiters for (annotations, field/values, value lists).

Returns

modified sequences.

Return type

SeqResult

presto.Sequence.extractAlignment(seq_record, start, length, rev_primer=False)

Extracts a subsequence from sequence

Parameters
  • data – SeqRecord to process.

  • start – position where subsequence starts.

  • length – the length of the subsequence to extract.

  • rev_primer – if True extract relative to the tail end of the sequence.

Returns

extraction results as an alignment object

Return type

presto.Sequence.PrimerAlignment

presto.Sequence.filterLength(data, min_length=250, inner=True, missing_chars='-nN.')

Filters sequences by length

Parameters
  • data (SeqData) – a SeqData object with a single SeqRecord to process.

  • min_length (int) – the minimum length allowed.

  • inner (bool) – if True exclude outer missing characters from calculation.

  • missing_chars (str) – a string of missing character values.

Returns

SeqResult object.

Return type

SeqResult

presto.Sequence.filterMissing(data, max_missing=10, inner=True, missing_chars='-nN.')

Filters sequences by number of missing nucleotides

Parameters
  • data (SeqData) – SeqData object with a single SeqRecord to process.

  • max_missing (int) – the maximum number of allowed ambiguous characters.

  • inner (bool) – if True exclude outer missing characters from calculation.

  • missing_chars (str) – a string of missing character values.

Returns

SeqResult object.

Return type

SeqResult

presto.Sequence.filterQuality(data, min_qual=0, inner=True, missing_chars='-nN.')

Filters sequences by quality score

Parameters
  • data (SeqData) – a SeqData object with a single SeqRecord to process.

  • min_qual (int) – minimum mean quality score for retained sequences.

  • inner (bool) – if True exclude outer missing characters from calculation.

  • missing_chars (str) – a string of missing character values.

Returns

SeqResult object.

Return type

SeqResult

presto.Sequence.filterRepeats(data, max_repeat=15, include_missing=False, inner=True, missing_chars='-nN.')

Filters sequences by fraction of ambiguous nucleotides

Parameters
  • data (SeqData) – a SeqData object with a single SeqRecord to process.

  • max_repeat (int) – the maximum number of allowed repeating characters.

  • include_missing (int) – if True count ambiguous character repeats; if False do not consider ambiguous character repeats.

  • inner (int) – if True exclude outer missing characters from calculation.

  • missing_chars (str) – a string of missing character values.

Returns

SeqResult object.

Return type

SeqResult

presto.Sequence.findGapPositions(seq_list, max_gap, gap_chars={'-', '.'})

Finds positions in a set of aligned sequences with a high number of gap characters.

Parameters
  • seq_list – List of SeqRecord objects with aligned sequences

  • max_gap – Float of the maximum gap frequency to consider a position as non-gapped

  • gap_chars – Set of characters to consider as gaps

Returns

Positions (indices) with gap frequency greater than max_gap

Return type

list

presto.Sequence.frequencyConsensus(seq_list, min_freq=0.6, ignore_chars={'-', '.', 'N', 'n'})

Builds a consensus sequence from a set of sequences

Parameters
  • set_seq – List of SeqRecord objects

  • min_freq – Frequency cutoff to assign a base

  • ignore_chars – Set of characters to exclude when building a consensus sequence

Returns

Consensus SeqRecord object

Return type

SeqRecord

presto.Sequence.getAAScoreDict(mask_score=None, gap_score=None)

Generates a score dictionary

Parameters
  • mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

  • gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

Returns

Score dictionary with keys (char1, char2) mapping to scores

Return type

dict

presto.Sequence.getDNAScoreDict(mask_score=None, gap_score=None)

Generates a score dictionary

Parameters
  • mask_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

  • gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

Returns

Score dictionary with keys (char1, char2) mapping to scores

Return type

dict

presto.Sequence.indexSeqSets(seq_dict, field='BARCODE', delimiter=('|', '=', ','))

Identifies sets of sequences with the same ID field

Parameters
  • seq_dict – a dictionary index of sequences returned from SeqIO.index()

  • field – the annotation field containing set IDs

  • delimiter – a tuple of delimiters for (fields, values, value lists)

Returns

Dictionary mapping set name to a list of record names

Return type

dict

presto.Sequence.joinAssembly(head_seq, tail_seq, gap=0, insert_seq=None)

Concatenates two sequences

Parameters
  • head_seq – the head SeqRecord.

  • tail_seq – the tail SeqRecord.

  • gap – number of gap characters to insert between head and tail ignored if insert_seq is not None.

  • insert_seq – a string or Bio.Seq.Seq object, to insert between the head and tail; if None insert with N characters.

Returns

assembled sequence object.

Return type

AssemblyRecord

presto.Sequence.localAlignment(seq_record, primers, primers_regex=None, max_error=0.3, max_len=1000, rev_primer=False, skip_rc=False, gap_penalty=(1, 1), score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'U'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'U'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'U'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'U'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'U'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'U'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'U'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'U'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'U'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'U'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 0, ('N', 'B'): 0, ('N', 'C'): 0, ('N', 'D'): 0, ('N', 'G'): 0, ('N', 'H'): 0, ('N', 'K'): 0, ('N', 'M'): 0, ('N', 'N'): 0, ('N', 'R'): 0, ('N', 'S'): 0, ('N', 'T'): 0, ('N', 'U'): 0, ('N', 'V'): 0, ('N', 'W'): 0, ('N', 'Y'): 0, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'U'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'U'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'U'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('U', '-'): 0, ('U', '.'): 0, ('U', 'A'): 0, ('U', 'B'): 1, ('U', 'C'): 0, ('U', 'D'): 1, ('U', 'G'): 0, ('U', 'H'): 1, ('U', 'K'): 1, ('U', 'M'): 0, ('U', 'N'): 1, ('U', 'R'): 0, ('U', 'S'): 0, ('U', 'T'): 1, ('U', 'U'): 1, ('U', 'V'): 0, ('U', 'W'): 1, ('U', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'U'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'U'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'U'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Performs pairwise local alignment of a list of short sequences against a long sequence

Parameters
  • seq_record – a SeqRecord object to align primers against

  • primers – dictionary of {names: short IUPAC ambiguous sequence strings}

  • primers_regex – optional dictionary of {names: compiled primer regular expressions}

  • max_error – maximum acceptable error rate before aligning reverse complement

  • max_len – maximum length of sample sequence to align

  • rev_primer – if True align with the tail end of the sequence

  • skip_rc – if True do not check reverse complement sequences

  • gap_penalty – a tuple of positive (gap open, gap extend) penalties

  • score_dict – optional dictionary of alignment scores as {(char1, char2): score}

Returns

primer alignment result object

Return type

presto.Sequence.PrimerAlignment

presto.Sequence.maskQuality(data, min_qual=0)

Masks characters by in sequence by quality score

Parameters
  • data (SeqData) – a SeqData object with a single SeqRecord to process.

  • min_qual (int) – minimum quality for retained characters.

Returns

SeqResult object.

Return type

SeqResult

presto.Sequence.maskSeq(align, mode='mask', barcode=False, barcode_field='BARCODE', primer_field='PRIMER', delimiter=('|', '=', ','))

Create an output sequence with primers masked or cut

Parameters
  • align – a PrimerAlignment object.

  • mode – defines the action taken; one of ‘cut’, ‘mask’, ‘tag’ or ‘trim’.

  • barcode – if True add sequence preceding primer to description.

  • barcode_field – name of the output barcode annotation.

  • primer_field – name of the output primer annotation.

  • delimiter – a tuple of delimiters for (annotations, field/values, value lists).

Returns

masked sequence.

Return type

Bio.SeqRecord.SeqRecord

presto.Sequence.meanQuality(qual, prob=(1.0, 0.7943282347242815, 0.6309573444801932, 0.5011872336272722, 0.3981071705534972, 0.31622776601683794, 0.251188643150958, 0.19952623149688797, 0.15848931924611134, 0.12589254117941673, 0.1, 0.07943282347242814, 0.06309573444801933, 0.05011872336272722, 0.039810717055349734, 0.03162277660168379, 0.025118864315095794, 0.0199526231496888, 0.015848931924611134, 0.012589254117941675, 0.01, 0.007943282347242814, 0.00630957344480193, 0.005011872336272725, 0.003981071705534973, 0.0031622776601683794, 0.0025118864315095794, 0.001995262314968879, 0.001584893192461114, 0.0012589254117941675, 0.001, 0.0007943282347242813, 0.000630957344480193, 0.0005011872336272725, 0.00039810717055349735, 0.00031622776601683794, 0.00025118864315095795, 0.00019952623149688788, 0.00015848931924611142, 0.00012589254117941674, 0.0001, 7.943282347242822e-05, 6.309573444801929e-05, 5.011872336272725e-05, 3.9810717055349695e-05, 3.1622776601683795e-05, 2.5118864315095822e-05, 1.9952623149688786e-05, 1.584893192461114e-05, 1.2589254117941661e-05, 1e-05, 7.943282347242822e-06, 6.30957344480193e-06, 5.011872336272725e-06, 3.981071705534969e-06, 3.162277660168379e-06, 2.5118864315095823e-06, 1.9952623149688787e-06, 1.584893192461114e-06, 1.2589254117941661e-06, 1e-06, 7.943282347242822e-07, 6.30957344480193e-07, 5.011872336272725e-07, 3.981071705534969e-07, 3.162277660168379e-07, 2.5118864315095823e-07, 1.9952623149688787e-07, 1.584893192461114e-07, 1.2589254117941662e-07, 1e-07, 7.943282347242822e-08, 6.30957344480193e-08, 5.011872336272725e-08, 3.981071705534969e-08, 3.162277660168379e-08, 2.511886431509582e-08, 1.9952623149688786e-08, 1.5848931924611143e-08, 1.2589254117941661e-08, 1e-08, 7.943282347242822e-09, 6.309573444801943e-09, 5.011872336272715e-09, 3.981071705534969e-09, 3.1622776601683795e-09, 2.511886431509582e-09, 1.9952623149688828e-09, 1.584893192461111e-09, 1.2589254117941663e-09, 1e-09, 7.943282347242822e-10, 6.309573444801942e-10, 5.011872336272714e-10, 3.9810717055349694e-10, 3.1622776601683795e-10, 2.511886431509582e-10, 1.9952623149688828e-10, 1.584893192461111e-10, 1.2589254117941662e-10, 1e-10, 7.943282347242822e-11, 6.309573444801942e-11, 5.011872336272715e-11, 3.9810717055349695e-11, 3.1622776601683794e-11, 2.5118864315095823e-11, 1.9952623149688828e-11, 1.5848931924611107e-11, 1.2589254117941662e-11, 1e-11, 7.943282347242821e-12, 6.309573444801943e-12, 5.011872336272715e-12, 3.9810717055349695e-12, 3.1622776601683794e-12, 2.5118864315095823e-12, 1.9952623149688827e-12, 1.584893192461111e-12, 1.258925411794166e-12, 1e-12, 7.943282347242822e-13, 6.309573444801942e-13, 5.011872336272715e-13, 3.981071705534969e-13, 3.162277660168379e-13, 2.511886431509582e-13, 1.9952623149688827e-13))

Calculate mean quality score

Parameters
  • qual (list) – numeric Phred quality scores.

  • prob (list) – mapping of Phred score (index) to probability values

Returns

floor of the mean Phred quality score.

Return type

int

presto.Sequence.overlapConsensus(head_seq, tail_seq, ignore_chars={'-', '.', 'N', 'n'})

Creates a consensus overlap sequences from two segments

Parameters
  • head_seq – the overlap head SeqRecord.

  • tail_seq – the overlap tail SeqRecord.

  • ignore_chars – list of characters which do not contribute to consensus.

Returns

A SeqRecord object with consensus characters and quality scores.

Return type

SeqRecord

presto.Sequence.qualityConsensus(seq_list, min_qual=0, min_freq=0.6, dependent=False, ignore_chars={'-', '.', 'N', 'n'})

Builds a consensus sequence from a set of sequences

Parameters
  • seq_list – List of SeqRecord objects

  • min_qual – Quality cutoff to assign a base

  • min_freq – Frequency cutoff to assign a base

  • dependent – If False assume sequences are independent for quality calculation

  • ignore_chars – Set of characters to exclude when building a consensus sequence

Returns

Consensus SeqRecord object

Return type

SeqRecord

presto.Sequence.referenceAssembly(head_seq, tail_seq, ref_dict, ref_db, min_ident=0.5, evalue=1e-05, max_hits=100, fill=False, aligner='usearch', aligner_exec='usearch', score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'U'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'U'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'U'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'U'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'U'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'U'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'U'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'U'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'U'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'U'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'U'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'U'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'U'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'U'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('U', '-'): 0, ('U', '.'): 0, ('U', 'A'): 0, ('U', 'B'): 1, ('U', 'C'): 0, ('U', 'D'): 1, ('U', 'G'): 0, ('U', 'H'): 1, ('U', 'K'): 1, ('U', 'M'): 0, ('U', 'N'): 1, ('U', 'R'): 0, ('U', 'S'): 0, ('U', 'T'): 1, ('U', 'U'): 1, ('U', 'V'): 0, ('U', 'W'): 1, ('U', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'U'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'U'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'U'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Stitches two sequences together by aligning against a reference database

Parameters
  • head_seq – the head SeqRecord.

  • head_seq – the tail SeqRecord.

  • ref_dict – a dictionary of reference SeqRecord objects.

  • ref_db – the path and name of the reference database.

  • min_ident – the minimum identity for a valid assembly.

  • evalue – the E-value cut-off for ublast.

  • max_hits – the maxhits output limit for ublast.

  • fill – if False non-overlapping regions will be assigned Ns; if True non-overlapping regions will be filled with the reference sequence.

  • aligner – the alignment tool; one of ‘blastn’ or ‘usearch’.

  • aligner_exec – the path to the alignment tool executable.

  • score_dict – optional dictionary of character scores in the form {(char1, char2): score}.

Returns

assembled sequence object.

Return type

AssemblyRecord

presto.Sequence.reverseComplement(seq)

Takes the reverse complement of a sequence

Parameters

seq – a SeqRecord object, Seq object or string to reverse complement

Returns

Object of the same type as the input with the reverse complement sequence

Return type

Seq

presto.Sequence.scoreAA(a, b, mask_score=None, gap_score=None)

Returns the score for a pair of IUPAC Extended Protein characters

Parameters
  • a – First character

  • b – Second character

  • mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

  • gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

Returns

Score for the character pair

Return type

int

presto.Sequence.scoreAlignment(seq_record, primers, start=0, rev_primer=False, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'U'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'U'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'U'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'U'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'U'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'U'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'U'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'U'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'U'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'U'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 0, ('N', 'B'): 0, ('N', 'C'): 0, ('N', 'D'): 0, ('N', 'G'): 0, ('N', 'H'): 0, ('N', 'K'): 0, ('N', 'M'): 0, ('N', 'N'): 0, ('N', 'R'): 0, ('N', 'S'): 0, ('N', 'T'): 0, ('N', 'U'): 0, ('N', 'V'): 0, ('N', 'W'): 0, ('N', 'Y'): 0, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'U'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'U'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'U'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('U', '-'): 0, ('U', '.'): 0, ('U', 'A'): 0, ('U', 'B'): 1, ('U', 'C'): 0, ('U', 'D'): 1, ('U', 'G'): 0, ('U', 'H'): 1, ('U', 'K'): 1, ('U', 'M'): 0, ('U', 'N'): 1, ('U', 'R'): 0, ('U', 'S'): 0, ('U', 'T'): 1, ('U', 'U'): 1, ('U', 'V'): 0, ('U', 'W'): 1, ('U', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'U'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'U'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'U'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Performs a simple fixed position alignment of primers

Parameters
  • seq_record – a SeqRecord object to align primers against

  • primers – dictionary of {names: short IUPAC ambiguous sequence strings}

  • start – position where primer alignment starts

  • rev_primer – if True align with the tail end of the sequence

  • score_dict – optional dictionary of {(char1, char2): score} alignment scores

Returns

primer alignment result object

Return type

presto.Sequence.PrimerAlignment

presto.Sequence.scoreDNA(a, b, mask_score=None, gap_score=None)

Returns the score for a pair of IUPAC ambiguous nucleotide characters

Parameters
  • a – First character

  • b – Second character

  • n_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

  • gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity

Returns

Score for the character pair

Return type

int

presto.Sequence.scoreSeqPair(seq1, seq2, ignore_chars={}, score_dict={('-', '-'): 1, ('-', '.'): 1, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'U'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 1, ('.', '.'): 1, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'U'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'U'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'U'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'U'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'U'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'U'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'U'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'U'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'U'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 0, ('N', '.'): 0, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'U'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'U'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'U'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'U'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('U', '-'): 0, ('U', '.'): 0, ('U', 'A'): 0, ('U', 'B'): 1, ('U', 'C'): 0, ('U', 'D'): 1, ('U', 'G'): 0, ('U', 'H'): 1, ('U', 'K'): 1, ('U', 'M'): 0, ('U', 'N'): 1, ('U', 'R'): 0, ('U', 'S'): 0, ('U', 'T'): 1, ('U', 'U'): 1, ('U', 'V'): 0, ('U', 'W'): 1, ('U', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'U'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'U'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'U'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Determine the error rate for a pair of sequences

Parameters
  • seq1 – SeqRecord object

  • seq2 – SeqRecord object

  • ignore_chars – Set of characters to ignore when scoring and counting the weight

  • score_dict – Optional dictionary of alignment scores

Returns

Tuple of the (score, minimum weight, error rate) for the pair of sequences

Return type

Tuple

presto.Sequence.sequentialAssembly(head_seq, tail_seq, ref_dict, ref_db, alpha=1e-05, max_error=0.3, min_len=8, max_len=1000, scan_reverse=False, min_ident=0.5, evalue=1e-05, max_hits=100, fill=False, aligner='usearch', aligner_exec='usearch', assembly_stats=None, score_dict={('-', '-'): 0, ('-', '.'): 0, ('-', 'A'): 0, ('-', 'B'): 0, ('-', 'C'): 0, ('-', 'D'): 0, ('-', 'G'): 0, ('-', 'H'): 0, ('-', 'K'): 0, ('-', 'M'): 0, ('-', 'N'): 0, ('-', 'R'): 0, ('-', 'S'): 0, ('-', 'T'): 0, ('-', 'U'): 0, ('-', 'V'): 0, ('-', 'W'): 0, ('-', 'Y'): 0, ('.', '-'): 0, ('.', '.'): 0, ('.', 'A'): 0, ('.', 'B'): 0, ('.', 'C'): 0, ('.', 'D'): 0, ('.', 'G'): 0, ('.', 'H'): 0, ('.', 'K'): 0, ('.', 'M'): 0, ('.', 'N'): 0, ('.', 'R'): 0, ('.', 'S'): 0, ('.', 'T'): 0, ('.', 'U'): 0, ('.', 'V'): 0, ('.', 'W'): 0, ('.', 'Y'): 0, ('A', '-'): 0, ('A', '.'): 0, ('A', 'A'): 1, ('A', 'B'): 0, ('A', 'C'): 0, ('A', 'D'): 1, ('A', 'G'): 0, ('A', 'H'): 1, ('A', 'K'): 0, ('A', 'M'): 1, ('A', 'N'): 1, ('A', 'R'): 1, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'U'): 0, ('A', 'V'): 1, ('A', 'W'): 1, ('A', 'Y'): 0, ('B', '-'): 0, ('B', '.'): 0, ('B', 'A'): 0, ('B', 'B'): 1, ('B', 'C'): 1, ('B', 'D'): 1, ('B', 'G'): 1, ('B', 'H'): 1, ('B', 'K'): 1, ('B', 'M'): 1, ('B', 'N'): 1, ('B', 'R'): 1, ('B', 'S'): 1, ('B', 'T'): 1, ('B', 'U'): 1, ('B', 'V'): 1, ('B', 'W'): 1, ('B', 'Y'): 1, ('C', '-'): 0, ('C', '.'): 0, ('C', 'A'): 0, ('C', 'B'): 1, ('C', 'C'): 1, ('C', 'D'): 0, ('C', 'G'): 0, ('C', 'H'): 1, ('C', 'K'): 0, ('C', 'M'): 1, ('C', 'N'): 1, ('C', 'R'): 0, ('C', 'S'): 1, ('C', 'T'): 0, ('C', 'U'): 0, ('C', 'V'): 1, ('C', 'W'): 0, ('C', 'Y'): 1, ('D', '-'): 0, ('D', '.'): 0, ('D', 'A'): 1, ('D', 'B'): 1, ('D', 'C'): 0, ('D', 'D'): 1, ('D', 'G'): 1, ('D', 'H'): 1, ('D', 'K'): 1, ('D', 'M'): 1, ('D', 'N'): 1, ('D', 'R'): 1, ('D', 'S'): 1, ('D', 'T'): 1, ('D', 'U'): 1, ('D', 'V'): 1, ('D', 'W'): 1, ('D', 'Y'): 1, ('G', '-'): 0, ('G', '.'): 0, ('G', 'A'): 0, ('G', 'B'): 1, ('G', 'C'): 0, ('G', 'D'): 1, ('G', 'G'): 1, ('G', 'H'): 0, ('G', 'K'): 1, ('G', 'M'): 0, ('G', 'N'): 1, ('G', 'R'): 1, ('G', 'S'): 1, ('G', 'T'): 0, ('G', 'U'): 0, ('G', 'V'): 1, ('G', 'W'): 0, ('G', 'Y'): 0, ('H', '-'): 0, ('H', '.'): 0, ('H', 'A'): 1, ('H', 'B'): 1, ('H', 'C'): 1, ('H', 'D'): 1, ('H', 'G'): 0, ('H', 'H'): 1, ('H', 'K'): 1, ('H', 'M'): 1, ('H', 'N'): 1, ('H', 'R'): 1, ('H', 'S'): 1, ('H', 'T'): 1, ('H', 'U'): 1, ('H', 'V'): 1, ('H', 'W'): 1, ('H', 'Y'): 1, ('K', '-'): 0, ('K', '.'): 0, ('K', 'A'): 0, ('K', 'B'): 1, ('K', 'C'): 0, ('K', 'D'): 1, ('K', 'G'): 1, ('K', 'H'): 1, ('K', 'K'): 1, ('K', 'M'): 0, ('K', 'N'): 1, ('K', 'R'): 1, ('K', 'S'): 1, ('K', 'T'): 1, ('K', 'U'): 1, ('K', 'V'): 1, ('K', 'W'): 1, ('K', 'Y'): 1, ('M', '-'): 0, ('M', '.'): 0, ('M', 'A'): 1, ('M', 'B'): 1, ('M', 'C'): 1, ('M', 'D'): 1, ('M', 'G'): 0, ('M', 'H'): 1, ('M', 'K'): 0, ('M', 'M'): 1, ('M', 'N'): 1, ('M', 'R'): 1, ('M', 'S'): 1, ('M', 'T'): 0, ('M', 'U'): 0, ('M', 'V'): 1, ('M', 'W'): 1, ('M', 'Y'): 1, ('N', '-'): 1, ('N', '.'): 1, ('N', 'A'): 1, ('N', 'B'): 1, ('N', 'C'): 1, ('N', 'D'): 1, ('N', 'G'): 1, ('N', 'H'): 1, ('N', 'K'): 1, ('N', 'M'): 1, ('N', 'N'): 1, ('N', 'R'): 1, ('N', 'S'): 1, ('N', 'T'): 1, ('N', 'U'): 1, ('N', 'V'): 1, ('N', 'W'): 1, ('N', 'Y'): 1, ('R', '-'): 0, ('R', '.'): 0, ('R', 'A'): 1, ('R', 'B'): 1, ('R', 'C'): 0, ('R', 'D'): 1, ('R', 'G'): 1, ('R', 'H'): 1, ('R', 'K'): 1, ('R', 'M'): 1, ('R', 'N'): 1, ('R', 'R'): 1, ('R', 'S'): 1, ('R', 'T'): 0, ('R', 'U'): 0, ('R', 'V'): 1, ('R', 'W'): 1, ('R', 'Y'): 0, ('S', '-'): 0, ('S', '.'): 0, ('S', 'A'): 0, ('S', 'B'): 1, ('S', 'C'): 1, ('S', 'D'): 1, ('S', 'G'): 1, ('S', 'H'): 1, ('S', 'K'): 1, ('S', 'M'): 1, ('S', 'N'): 1, ('S', 'R'): 1, ('S', 'S'): 1, ('S', 'T'): 0, ('S', 'U'): 0, ('S', 'V'): 1, ('S', 'W'): 0, ('S', 'Y'): 1, ('T', '-'): 0, ('T', '.'): 0, ('T', 'A'): 0, ('T', 'B'): 1, ('T', 'C'): 0, ('T', 'D'): 1, ('T', 'G'): 0, ('T', 'H'): 1, ('T', 'K'): 1, ('T', 'M'): 0, ('T', 'N'): 1, ('T', 'R'): 0, ('T', 'S'): 0, ('T', 'T'): 1, ('T', 'U'): 1, ('T', 'V'): 0, ('T', 'W'): 1, ('T', 'Y'): 1, ('U', '-'): 0, ('U', '.'): 0, ('U', 'A'): 0, ('U', 'B'): 1, ('U', 'C'): 0, ('U', 'D'): 1, ('U', 'G'): 0, ('U', 'H'): 1, ('U', 'K'): 1, ('U', 'M'): 0, ('U', 'N'): 1, ('U', 'R'): 0, ('U', 'S'): 0, ('U', 'T'): 1, ('U', 'U'): 1, ('U', 'V'): 0, ('U', 'W'): 1, ('U', 'Y'): 1, ('V', '-'): 0, ('V', '.'): 0, ('V', 'A'): 1, ('V', 'B'): 1, ('V', 'C'): 1, ('V', 'D'): 1, ('V', 'G'): 1, ('V', 'H'): 1, ('V', 'K'): 1, ('V', 'M'): 1, ('V', 'N'): 1, ('V', 'R'): 1, ('V', 'S'): 1, ('V', 'T'): 0, ('V', 'U'): 0, ('V', 'V'): 1, ('V', 'W'): 1, ('V', 'Y'): 1, ('W', '-'): 0, ('W', '.'): 0, ('W', 'A'): 1, ('W', 'B'): 1, ('W', 'C'): 0, ('W', 'D'): 1, ('W', 'G'): 0, ('W', 'H'): 1, ('W', 'K'): 1, ('W', 'M'): 1, ('W', 'N'): 1, ('W', 'R'): 1, ('W', 'S'): 0, ('W', 'T'): 1, ('W', 'U'): 1, ('W', 'V'): 1, ('W', 'W'): 1, ('W', 'Y'): 1, ('Y', '-'): 0, ('Y', '.'): 0, ('Y', 'A'): 0, ('Y', 'B'): 1, ('Y', 'C'): 1, ('Y', 'D'): 1, ('Y', 'G'): 0, ('Y', 'H'): 1, ('Y', 'K'): 1, ('Y', 'M'): 1, ('Y', 'N'): 1, ('Y', 'R'): 0, ('Y', 'S'): 1, ('Y', 'T'): 1, ('Y', 'U'): 1, ('Y', 'V'): 1, ('Y', 'W'): 1, ('Y', 'Y'): 1})

Stitches sequences together by first attempting de novo assembly then falling back to reference guided assembly

Parameters
  • head_seq – the head SeqRecord

  • head_seq – the tail SeqRecord

  • ref_dict – a dictionary of reference SeqRecord objects

  • ref_db – the path and name of the reference database

  • alpha – the minimum p-value for a valid de novo assembly

  • max_error – the maximum error rate for a valid de novo assembly

  • min_len – minimum length of overlap to test for de novo assembly

  • max_len – maximum length of overlap to test for de novo assembly

  • scan_reverse – if True allow the head sequence to overhang the end of the tail sequence in de novo assembly if False end alignment scan at end of tail sequence or start of head sequence

  • min_ident – the minimum identity for a valid reference guided assembly

  • evalue – the E-value cut-off for reference guided assembly

  • max_hits – the maxhits output limit for reference guided assembly

  • fill – if False non-overlapping regions will be assigned Ns in reference guided assembly; if True non-overlapping regions will be filled with the reference sequence.

  • aligner – the alignment tool; one of ‘blastn’ or ‘usearch’

  • aligner_exec – the path to the alignment tool executable

  • assembly_stats – optional successes by trials numpy.array of p-values

  • score_dict – optional dictionary of character scores in the form {(char1, char2): score}.

Returns

assembled sequence object.

Return type

AssemblyRecord

presto.Sequence.subsetSeqIndex(seq_dict, field, values, delimiter=('|', '=', ','))

Subsets a sequence set by annotation value

Parameters
  • seq_dict – Dictionary index of sequences returned from SeqIO.index()

  • field – Annotation field to select keys by

  • values – List of annotation values that define the retained keys

  • delimiter – Tuple of delimiters for (annotations, field/values, value lists)

Returns

List of keys

Return type

list

presto.Sequence.subsetSeqSet(seq_iter, field, values, delimiter=('|', '=', ','))

Subsets a sequence set by annotation value

Parameters
  • seq_iter – Iterator or list of SeqRecord objects

  • field – Annotation field to select by

  • values – List of annotation values that define the retained sequences

  • delimiter – Tuple of delimiters for (annotations, field/values, value lists)

Returns

Modified list of SeqRecord objects

Return type

list

presto.Sequence.translateAmbigDNA(key)

Translates IUPAC ambiguous nucleotide characters to or from character sets

Parameters

key – String or re.search object containing the character set to translate

Returns

Character translation

Return type

str

presto.Sequence.trimQuality(data, min_qual=0, window=10, reverse=False)

Cuts sequences using a moving mean quality score

Parameters
  • data (SeqData) – a SeqData object with a single SeqRecord to process.

  • min_qual (int) – minimum mean quality to define a cut point.

  • window (int) – nucleotide window size.

  • reverse (bool) – if True cut the head of the sequence; if False cut the tail of the sequence

Returns

SeqResult object.

Return type

SeqResult

presto.Sequence.weightSeq(seq, ignore_chars={})

Returns the length of a sequencing excluding ignored characters

Parameters
  • seq – SeqRecord or Seq object

  • ignore_chars – Set of characters to ignore when counting sequence length

Returns

Sum of the character scores for the sequence

Return type

int

Roche 454 BCR mRNA with Multiplexed Samples

Overview of Experimental Data

This example uses publicly available data from:

Lineage structure of the human antibody repertoire in response to influenza vaccination.
Jiang N, He J, and Weinstein JA, et al.
Sci Transl Med. 2013. 5(171):171ra19. doi:10.1126/scitranslmed.3004794.

Which may be downloaded from the NCBI Sequence Read Archive under accession ID: SRX190717. For this example, we will use the first 50,000 sequences of sample 43 (accession: SRR765688), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump -X 50000 SRR765688

Primer and sample barcode (referred to as MID in Jiang, He and Weinstein et al, 2013) sequences are available in the published manuscript. This example assumes that the sample barcodes, forward primers (V-region), and reverse primers (C-region) have been extracted from the manuscript and placed into three corresponding FASTA files.

Read Configuration

_images/Jiang2013_ReadConfiguration.svg

Schematic of the Roche 454 read configuration. The start of each sequences is labeled with the sample barcode. The following sequence may either be in the forward (top) orientation, proceeding 5’ to 3’ in the direction of the V(D)J reading frame, or the reverse complement orientation (bottom), proceeding in the opposite direction.

Example Data

We have hosted a small subset of the data (Accession: SRR765688) on the pRESTO website in FASTQ format, with accompanying primer and sample barcode (MID) files. The sample data set and workflow script may be downloaded from here:

Jiang, He and Weinstein et al, 2013 Example Files

Overview of the Workflow

The example that follows performs all processing steps to arrive at high-quality unique sequences using this example data set. The workflow is derived into three high level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart

_images/Jiang2013_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) quality control, (green) sample barcode and primer identification, (blue) deduplication and filtering of the repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env bash
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER
ParseLog.py -l FSL.log -f ID LENGTH
ParseLog.py -l FSQ.log -f ID QUALITY
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Download Commands

Quality control

The initial stage of the workflow involves two executions of the FilterSeq.py tool. First, the length subcommand is used to filter reads which are too short to yield full V(D)J sequences using a liberal minimum length requirement of 300 bp (-n 300):

2
FilterSeq.py length -s SRR765688.fastq -n 300 --outname S43 --log FSL.log

Next, the quality subcommand removes sequences having a mean Phred quality score below 20 (-q 20).

3
FilterSeq.py quality -s S43_length-pass.fastq -q 20 --outname S43 --log FSQ.log

The ParseLog.py tool is then used to extract the results from the two FilterSeq.py log files:

17
ParseLog.py -l FSL.log -f ID LENGTH

To create two tab-delimited files containing the following results for each read:

Field

Description

ID

Sequence name

LENGTH

Sequence length

QUALITY

Quality score

Sample barcode and primer identification

Annotation of sample barcodes

Following the initial filtering steps, additional filtering is performed with three iterations of the MaskPrimers.py tool based upon the presence of recognized sample barcode (MID), forward primer, and reverse primer sequences. As the orientation and position of the sample barcode is known, the first pass through MaskPrimers uses the faster score subcommand which requires a fixed start position (--start 0) and a low allowable error rate (--maxerror 0.1). The name of the MID will be annotated in the sequence header as the MID field (--pf MID):

4
5
6
MaskPrimers.py score -s S43_quality-pass.fastq -p SRR765688_MIDs.fasta \
    --start 0 --maxerror 0.1 --mode cut --pf MID \
    --outname S43-MID --log MPM.log

Primer masking and annotation

The next MaskPrimers.py task uses the align subcommand to identify both the start position of the V-segment primer and correct the orientation of the sequence such that all reads are now oriented in the direction of the V(D)J reading frame), as determined by the orientation of the V-segment primer match. The name of the V-segment primer will be annotated in the sequence header as the VPRIMER field (--pf VPRIMER):

7
8
9
MaskPrimers.py align -s S43-MID_primers-pass.fastq -p SRX190717_VPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode mask --pf VPRIMER \
    --outname S43-FWD --log MPV.log

The final MaskPrimers task locates the C-region primer, which is used for isotype assignment of each read. As all sequences are assumed to have been properly oriented by the second MaskPrimers task, the additional arguments MaskPrimers align --revpr and MaskPrimers align --skiprc are added to the third execution. The MaskPrimers align --revpr argument informs the tool that primers sequences should be reverse complemented prior to alignment, and that a match should be searched for (and cut from) the tail end of the sequence. The MaskPrimers align --skiprc argument tells the tool to align against only the forward sequence; meaning, it will not check primer matches against the reverse complement sequence and it will not reorient sequences. The name of the C-region primer will be annotated in the sequence header as the CPRIMER field (--pf CPRIMER):

10
11
12
MaskPrimers.py align -s S43-FWD_primers-pass.fastq -p SRX190717_CPrimers.fasta \
    --maxlen 50 --maxerror 0.3 --mode cut --pf CPRIMER --revpr --skiprc \
    --outname S43-REV --log MPC.log

At this stage, a table of primers and alignment error rates may be generated by executing ParseLog on the log file of each MaskPrimers.py tasks:

19
ParseLog.py -l MPM.log MPV.log MPC.log -f ID PRSTART PRIMER ERROR

Which will contain the following information for each log file:

Field

Description

ID

Sequence name

PRIMER

Primer or sample barcode name

ERROR

Primer match error rate

Deduplication and filtering

Removal of duplicate sequences

The final stage of the workflow involves two filtering steps to yield unique sequences for each sample barcode. First, the set of unique sequences is identified using the CollapseSeq.py tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicated share the same isotype and sample barcode tag (--uf MID CPRIMER). Additionally, the V-region primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

13
14
CollapseSeq.py -s S43-REV_primers-pass.fastq -n 20 --inner --uf MID CPRIMER \
    --cf VPRIMER --act set --outname S43

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq.py on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

15
SplitSeq.py group -s S43_collapse-unique.fastq -f DUPCOUNT --num 2 --outname S43

Creating an annotation table

Finally, the annotations, including the sample barcode (MID), duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-region primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq.py output into a tab-delimited file using the table subcommand of ParseHeaders.py:

16
ParseHeaders.py table -s S43_atleast-2.fastq -f ID DUPCOUNT MID CPRIMER VPRIMER

Note

Optionally, you may split each sample into separate files using the MID annotation and an alternate invocation of SplitSeq.py. The group subcommand may be used to split files on a categorical field, rather than a numerical field, by skipping the --num argument:

SplitSeq.py group -s M1_collapse-unique.fastq -f MID

Will split the unique sequence file into a set of separate files according the the valud in the MID field (-f MID), such that each file will contain sequences from only one sample.

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File

Description

S43_collapse-unique.fastq

Total unique sequences

S43_atleast-2.fastq

Unique sequences represented by at least 2 reads

S43_atleast-2_headers.tab

Annotation table of the atleast-2 file

FSL_table.tab

Table of the FilterSeq-length log

FSQ_table.tab

Table of the FilterSeq-quality log

MPM_table.tab

Table of the MID MaskPrimers log

MPV_table.tab

Table of the V-segment MaskPrimers log

MPC_table.tab

Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Performance

Example performance statistics for a comparable, but larger, 454 workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,346,039 raw reads, and required matching of 11 sample barcodes, 11 V-segment primers, and 5 C-region primers.

Line

Tool

Reads

Cores

MB

Minutes

01

FilterSeq.py length

1,346,039

10

669

12.2

02

FilterSeq.py quality

1,184,905

10

619

11.5

03

MaskPrimers.py score

1,184,219

10

621

18.5

04

MaskPrimers.py align

1,177,245

10

618

263.1

05

MaskPrimers.py align

959,920

10

563

122.9

07

ParseHeaders.py expand

548,658

1

329

3.6

08

ParseHeaders.py rename

548,658

1

329

4.1

09

CollapseSeq.py

548,658

1

1,509

5.6

Illumina MiSeq 2x250 BCR mRNA

Overview of Experimental Data

This example uses publicly available data from:

Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice.
Greiff, V. et al.
BMC Immunol. 2014. 15(1):40. doi:10.1186/s12865-014-0040-5.

Which may be downloaded from the EBI European Nucleotide Archive under accession ID: ERP003950. For this example, we will use the first 25,000 sequences of sample Replicate-1-M1 (accession: ERR346600), which may downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 ERR346600

Primers sequences are available in additional file 1 of the publication.

Read Configuration

_images/Greiff2014_ReadConfiguration.svg

Schematic of Illumina MiSeq 2x250 stranded paired-end reads without UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains the C-region primer sequence. Both reads begin with a random sequences of 4 nucleotides.

Example Data

We have hosted a small subset of the data (Accession: ERR346600) on the pRESTO website in FASTQ format, with accompanying primer files and an example workflow script. The sample data set and workflow script may be downloaded from here:

Greiff et al, 2014 Example Files

Overview of the Workflow

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart

_images/Greiff2014_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into three primary tasks: (orange) paired-end assembly, (green) quality control and primer annotation, and deduplication and filtering (blue). The intermediate files output by each tool are not shown for the sake of brevity.

Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/usr/bin/env bash
AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log
FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log
MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log
CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1
SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1
ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE
ParseLog.py -l FS.log -f ID QUALITY
ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Download Commands

Paired-end assembly

Each set of paired-ends mate-pairs is first assembled into a full length Ig sequence using the align subcommand of the AssemblePairs.py tool:

2
3
AssemblePairs.py align -1 ERR346600_2.fastq -2 ERR346600_1.fastq \
    --coord sra --rc tail --outname M1 --log AP.log

During assembly we have defined read 2 (V-region) as the head of the sequence (-1) and read 1 as the tail of the sequence (-2). The --coord argument defines the format of the sequence header so that AssemblePairs.py can properly identify mate-pairs; in this case, we use --coord sra as our headers are in the SRA/ENA format.

Note

For both the AssemblePairs.py and PairSeq.py commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then the appropriate argument would be --coord illumina.

The ParseLog.py tool is then used to build a tab-delimited file of results from the AssemblePairs.py log file:

13
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE

Which will containing the following columns:

Field

Description

ID

Sequence name

LENGTH

Length of the assembled sequence

OVERLAP

Length of the overlap between mate-pairs

ERROR

Mismatch rate of the overlapping region

PVALUE

P-value for the assembly

See also

Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs.py can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.

Quality control and primer annotation

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq.py tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

4
FilterSeq.py quality -s M1_assemble-pass.fastq -q 20 --outname M1 --log FS.log

The ParseLog.py tool is then used to build tab-delimited file from the FilterSeq.py log:

14
ParseLog.py -l FS.log -f ID QUALITY

Capturing the following annotations:

Field

Description

ID

Sequence name

QUALITY

Quality score

Read annotation and masking of primer regions

When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The score subcommand of MaskPrimers.py is used to identify and remove the V-segment and C-region PCR primers for both reads:

5
6
7
8
MaskPrimers.py score -s M1_quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --start 4 --mode mask --pf VPRIMER --outname M1-FWD --log MPV.log
MaskPrimers.py score -s M1-FWD_primers-pass.fastq -p Greiff2014_CPrimers.fasta \
    --start 4 --mode cut --revpr --pf CPRIMER --outname M1-REV --log MPC.log

In this data set the authors have added a random sequence of 4 bp to the start of each read before the primer sequence to increase sequence diversity and the reliability of cluster calling on the Illumina platform. As such, both primers begin at position 4 (--start 4), but the C-region primer begins 4 bases from the end of the assembled read. The addition of the --revpr argument to the second MaskPrimers.py step instructs the tool to reverse complement the primer sequences and check the tail of the read. The two primer regions have also been treated differently. The V-segment primer has been masked (replaced by Ns) using the --mode mask argument to preserve the V(D)J length, while the C-region primer has been removed from the sequence using the --mode cut argument.

During each iteration of the MaskPrimers.py tool an annotation field was added with the name of the primer, with the name of the field specified by the --pf argument, such that after both iterations each sequence contains an annotation of the form:

VPRIMER=V-segment primer|CPRIMER=C-region primer

To summarize these steps, the ParseLog.py tool is used to build tab-delimited files from the two MaskPrimers.py logs:

15
ParseLog.py -l MPV.log MPC.log -f ID PRIMER ERROR

Capturing the following annotations:

Field

Description

ID

Sequence name

PRIMER

Primer name

ERROR

Primer match error rate

Note

This library was prepared in a stranded manner. Meaning, the read orientation is constant for all reads; read 1 is always the C-region end of the amplicon and read 2 is always the V-segment end. If your data is unstranded (50% of the reads are forward, 50% are reversed), then you must modify the first MaskPrimers.py step to account for this by using the align subcommand instead:

MaskPrimers.py align -s M1*quality-pass.fastq -p Greiff2014_VPrimers.fasta \
    --maxlen 30 --mode mask --pf VPRIMER --log MP1.log

This will perform a slower process of locally aligning the primers, checking the reverse compliment of each read for matches, and correcting the the output sequences to the forward orientation (V to J).

Deduplication and filtering

Removal of duplicate sequences

The last stage of the workflow involves two filtering steps to yield the final repertoire. First, the set of unique sequences is identified using the CollapseSeq.py tool, allowing for up to 20 interior N-valued positions (-n 20 and --inner), and requiring that all reads considered duplicates share the same C-region primer annotation (--uf CPRIMER). Additionally, the V-segment primer annotations of the set of duplicate reads are propagated into the annotation of each retained unique sequence (--cf VPRIMER and --act set):

 9
10
CollapseSeq.py -s M1-REV_primers-pass.fastq -n 20 --inner --uf CPRIMER \
    --cf VPRIMER --act set --outname M1

Filtering to repeated sequences

CollapseSeq stores the count of duplicate reads for each sequence in the DUPCOUNT annotation. Following duplicate removal, the data is subset to only those unique sequence with at least two representative reads by using the group subcommand of SplitSeq.py on the count field (-f DUPCOUNT) and specifying a numeric threshold (--num 2):

11
SplitSeq.py group -s M1_collapse-unique.fastq -f DUPCOUNT --num 2 --outname M1

Creating an annotation table

Finally, the annotations, including duplicate read count (DUPCOUNT), isotype (CPRIMER) and V-segment primer (VPRIMER), of the final repertoire are then extracted from the SplitSeq.py output into a tab-delimited file using the table subcommand of ParseHeaders.py:

12
ParseHeaders.py table -s M1_atleast-2.fastq -f ID DUPCOUNT CPRIMER VPRIMER

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File

Description

M1_collapse-unique.fastq

Total unique sequences

M1_atleast-2.fastq

Unique sequences represented by at least 2 reads

M1_atleast-2_headers.tab

Annotation table of the atleast-2 file

AP_table.tab

Table of the AssemblePairs log

FS_table.tab

Table of the FilterSeq log

MPV_table.tab

Table of the V-segment MaskPrimers log

MPC_table.tab

Table of the C-region MaskPrimers log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

UMI Barcoded Illumina MiSeq 2x250 BCR mRNA

Overview of Experimental Data

This example uses publicly available data from:

B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes.
Stern JNH, Yaari G, and Vander Heiden JA, et al.
Sci Transl Med. 2014. 6(248):248ra107. doi:10.1126/scitranslmed.3008879.

Which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample M12 (accession: SRR1383456), which may downloaded downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 SRR1383456

Primers sequences are available online at the supplemental website for the publication.

Read Configuration

_images/Stern2014_ReadConfiguration.svg

Schematic of the Illumina MiSeq 2x250 paired-end reads with UMI barcodes. Each 250 base-pair read was sequenced from one end of the target cDNA, so that the two reads together cover the entire variable region of the Ig heavy chain. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), and contains a 15 nucleotide UMI barcode preceding the C-region primer sequence.

Example Data

We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:

Stern, Yaari and Vander Heiden et al, 2014 Example Files

Overview of the Workflow

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart

_images/Stern2014_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into four primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensus sequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtain the high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/env bash
FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log
MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
    --start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
    --start 0 --mode mask --outname MS12_R2 --log MP2.log
PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
    --1f BARCODE --coord sra
BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log
PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
    --coord presto
AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
    -2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
    --1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log
ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min
CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
    --cf CONSCOUNT --act sum --outname MS12
SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12
ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT
ParseLog.py -l FS1.log FS2.log -f ID QUALITY
ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \
    PRFREQ ERROR
ParseLog.py -l AP.log -f ID LENGTH OVERLAP ERROR PVALUE FIELDS1 FIELDS2

Download Commands

Quality control, UMI annotation and primer masking

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq.py tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

2
3
FilterSeq.py quality -s SRR1383456_1.fastq -q 20 --outname MS12_R1 --log FS1.log
FilterSeq.py quality -s SRR1383456_2.fastq -q 20 --outname MS12_R2 --log FS2.log

The ParseLog.py tool is then used to extract results from the FilterSeq.py logs into tab-delimited files:

24
ParseLog.py -l FS1.log FS2.log -f ID QUALITY

Extracting the following information from the log:

Field

Description

ID

Sequence name

QUALITY

Quality score

UMI annotation and masking of primer regions

Next, the score subcommand of MaskPrimers.py is used to identify and remove the PCR primers for both reads. When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The MaskPrimers tool is also used to annotate each read 1 sequence with the 15 nucleotide UMI that precedes the C-region primer (MaskPrimers score --barcode):

4
5
6
7
MaskPrimers.py score -s MS12_R1_quality-pass.fastq -p Stern2014_CPrimers.fasta \
    --start 15 --mode cut --barcode --outname MS12_R1 --log MP1.log
MaskPrimers.py score -s MS12_R2_quality-pass.fastq -p Stern2014_VPrimers.fasta \
    --start 0 --mode mask --outname MS12_R2 --log MP2.log

To summarize these steps, the ParseLog.py tool is used to build a tab-delimited file from the MaskPrimers.py log:

25
ParseLog.py -l MP1.log MP2.log -f ID PRIMER BARCODE ERROR

Containing the following information:

Field

Description

ID

Sequence name

PRIMER

Primer name

BARCODE

UMI sequence

ERROR

Primer match error rate

Note

For this data set the UMI is immediately upstream of the C-region primer. Another common approach for UMI barcoding involves placing the UMI immediately upstream of a 5’RACE template switch site. Modifying the workflow is simple for this case. You just need to replace the V-segment primers with a fasta file containing the TS sequences and move the --barcode argument to the appropriate read:

MaskPrimers.py score -s R1_quality-pass.fastq -p CPrimers.fasta \
    --start 0 --mode cut --outname R1 --log MP1.log
MaskPrimers.py score -s R2_quality-pass.fastq -p TSSites.fasta \
    --start 17 --barcode --mode cut --maxerror 0.5 \
    --outname R2 --log MP2.log

In the above we have moved the UMI annotation to read 2, increased the allowable error rate for matching the TS site (--maxerror 0.5), cut the TS site (--mode cut), and increased the size of the UMI from 15 to 17 nucleotides (--start 17).

Generation of UMI consensus sequences

Copying the UMI annotation across paired-end files

In this task, a single consensus sequence is constructed for each set of reads annotated with the same UMI barcode. As the UMI barcode is part of read 1, the BARCODE annotation identified by MaskPrimers.py must first be copied to the read 2 mate-pair of each read 1 sequence. Propogation of annotations between mate pairs is performed using PairSeq.py which also removes unpaired reads and ensures that paired reads are sorted in the same order across files:

8
9
PairSeq.py -1 MS12_R1_primers-pass.fastq -2 MS12_R2_primers-pass.fastq \
    --1f BARCODE --coord sra

Note

For both the PairSeq.py and AssemblePairs.py commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA/ENA, then the appropriate argument would be --coord illumina.

Note

If you have followed the 5’RACE modification above, then you must also modify the first PairSeq.py step to copy the UMI from read 2 to read 1, instead of vice versa (--2f BARCODE):

PairSeq.py -1 R1_primers-pass.fastq -2 R2_primers-pass.fastq \
    --2f BARCODE --coord sra

Multiple alignment of UMI read groups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar. This type of primer misalignment can be corrected using the AlignSets.py tool. In the example data used here, this step was not necessary due to the aligned primer design for the 45 V-segment primers, though this does require that the V-segment primers be masked, rather than cut, during the MaskPrimers.py step (--mode mask).

See also

If your data requires alignment, then you can create multiple aligned UMI read groups as follows:

AlignSets.py muscle -s R1_primers-pass_pair-pass.fastq --bf BARCODE \
    --exec ~/bin/muscle --outname R1 --log AS1.log
AlignSets.py muscle -s R2_primers-pass_pair-pass.fastq --bf BARCODE \
    --exec ~/bin/muscle --outname R2 --log AS2.log

Where the --bf BARCODE defines the field containing the UMI and --exec ~/bin/muscle is the location of the MUSCLE executable.

For additional details see the section on fixing UMI alignments.

Generating UMI consensus reads

After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus.py:

10
11
12
13
BuildConsensus.py -s MS12_R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname MS12_R1 --log BC1.log
BuildConsensus.py -s MS12_R2_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --maxerror 0.1 --maxgap 0.5 --outname MS12_R2 --log BC2.log

To correct for UMI chemistry and sequencing errors, UMI read groups having high error statistics (mismatch rate from consensus) are removed by specifiying the --maxerror 0.1 threshold. As the accuracy of the primer assignment in read 1 is critical for correct isotype identification, additional filtering of read 1 is carried out during this step. Specifying the --prcons 0.6 threshold: (a) removes individual sequences that do not share a common primer annotation with the majority of the set, (b) removes entire read groups which have ambiguous primer assignments, and (c) constructs a consensus primer assignment for each UMI.

Note

The --maxgap 0.5 argument tells BuildConsensus.py to use a majority rule to delete any gap positions which occur in more than 50% of the reads. The --maxgap argument is not really necessary for this example data set as we did not perform a multiple alignment of the UMI read groups. However, if you have performed an alignment, then use of --maxgap during consensus generation is highly recommended.

The ParseLog.py tool is then used to build a tab-delimited file contain the consensus results:

26
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRIMER PRCONS PRCOUNT \

With the following annotations:

Field

Description

BARCODE

UMI sequence

SEQCOUNT

Number of total reads in the UMI group

CONSCOUNT

Number of reads used for the UMI consensus

PRIMER

Set of primer names in the UMI group

PRCONS

Consensus primer name

PRCOUNT

Count of primers in the UMI group

PRFREQ

Frequency of primers in the UMI group

ERROR

Average mismatch rate from consensus

Paired-end assembly of UMI consensus sequences

Syncronizing paired-end files

Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus.py. To synchronize the reads another instance of PairSeq.py must be run, but without any annotation manipulation:

14
15
PairSeq.py -1 MS12_R1_consensus-pass.fastq -2 MS12_R2_consensus-pass.fastq \
    --coord presto

Assembling UMI consensus mate-pairs

Once the files have been synchronized, each paired-end UMI consensus sequence is assembled into a full length Ig sequence using the align subcommand of AssemblePairs.py:

16
17
18
AssemblePairs.py align -1 MS12_R2_consensus-pass_pair-pass.fastq \
    -2 MS12_R1_consensus-pass_pair-pass.fastq --coord presto --rc tail \
    --1f CONSCOUNT --2f CONSCOUNT PRCONS --outname MS12 --log AP.log

During assembly, the consensus isotype annotation (PRCONS) from read 1 and the number of reads used to define the consensus sequence (CONSCOUNT) for both reads are propagated into the annotations of the full length Ig sequence (--1f CONSCOUNT --2f CONSCOUNT PRCONS.

ParseLog.py is then uses to extract the results from the AssemblePairs.py log into a tab-delimited file:

27
    PRFREQ ERROR

Containing the following information:

Field

Description

ID

Sequence name (UMI)

LENGTH

Length of the assembled sequence

OVERLAP

Length of the overlap between mate-pairs

ERROR

Mismatch rate of the overlapping region

PVALUE

P-value for the assembly

FIELDS1

Annotations copied from read 2 into the assembled sequence

FIELDS2

Annotations copied from read 1 into the assembled sequence

See also

Depending on the amplicon length in your data, not all mate-pairs may overlap. For the sake of simplicity, we have excluded a demonstration of assembly in such cases. pRESTO provides a couple approaches to deal with such reads. The reference subcommand of AssemblePairs.py can use the ungapped V-segment reference sequences to properly space non-overlapping reads. Or, if all else fails, the join subcommand can be used to simply stick mate-pairs together end-to-end with some intervening gap.

Deduplication and filtering

Combining UMI read group size annotations

In the final stage of the workflow, the high-fidelity Ig repertoire is obtained by a series of filtering steps. First, the annotation specifying the number of raw reads used to build each sequence (-f CONSCOUNT) is updated to be the minimum (--act min) of the forward and reverse reads using the collapse subcommand of ParseHeaders.py:

19
ParseHeaders.py collapse -s MS12_assemble-pass.fastq -f CONSCOUNT --act min

Removal of duplicate sequences

Second, duplicate nucleotide sequences are removed using the CollapseSeq.py tool with the requirement that duplicate sequences share the same isotype primer (--uf PRCONS). The duplicate removal step also removes sequences with a high number of interior N-valued nucleotides (-n 20 and --inner) and combines the read counts for each UMI read group (--cf CONSCOUNT and --act sum).

20
21
CollapseSeq.py -s MS12*reheader.fastq -n 20 --inner --uf PRCONS \
    --cf CONSCOUNT --act sum --outname MS12

Filtering to sequences with at least two representative reads

Finally, unique sequences are filtered to those with at least 2 contributing sequences using the group subcommand of SplitSeq.py, by splitting the file on the CONSCOUNT annotation with a numeric threshold (-f CONSCOUNT and --num 2):

22
SplitSeq.py group -s MS12_collapse-unique.fastq -f CONSCOUNT --num 2 --outname MS12

Creating an annotation table

For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders.py:

23
ParseHeaders.py table -s MS12_atleast-2.fastq -f ID PRCONS CONSCOUNT DUPCOUNT

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File

Description

M12_collapse-unique.fastq

Total unique sequences

M12_atleast-2.fastq

Unique sequences represented by at least 2 reads

M12_atleast-2_headers.tab

Annotation table of the atleast-2 file

FS1_table.tab

Table of the read 1 FilterSeq log

FS2_table.tab

Table of the read 2 FilterSeq log

MP1_table.tab

Table of the C-region MaskPrimers log

MP2_table.tab

Table of the V-segment MaskPrimers log

BC1_table.tab

Table of the read 1 BuildConsensus log

BC2_table.tab

Table of the read 2 BuildConsensus log

AP_table.tab

Table of the AssemblePairs log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Performance

Example performance statistics for a comparable, but larger, MiSeq workflow are presented below. Performance was measured on a 64-core system with 2.3GHz AMD Opteron(TM) 6276 processors and 512GB of RAM, with memory usage measured at peak utilization. The data set contained 1,723,558 x 2 raw reads, and required matching of 1 C-region primer, 45 V-segment primers, and averaged 24.3 reads per UMI.

Line

Tool

Reads

Cores

MB

Minutes

01

R1: FilterSeq.py quality

1,723,558

10

1,219

13.0

02

R2: FilterSeq.py quality

1,723,558

10

1,219

12.8

03

R1: MaskPrimers.py score

1,722,116

10

1,221

18.9

04

R2: MaskPrimers.py score

1,684,050

10

1,221

46.7

06

PairSeq.py

1,665,584

1

4,734

25.5

07

R1: BuildConsensus.py

1,565,017

10

1,228

50.1

08

R2: BuildConsensus.py

1,565,017

10

1,229

58.6

11

AssemblePairs.py align

66,285

10

358

3.0

13

ParseHeaders.py collapse

56,104

1

88

0.4

14

CollapseSeq.py

55,480

1

822

0.7

15

SplitSeq.py group

51,047

1

88

0.2

UMI Barcoded Illumina MiSeq 325+275 paired-end 5’RACE BCR mRNA

Overview of Experimental Data

This example uses publicly available data from:

Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing.
Vander Heiden JA, et al.
J Immunol. 2017 198(4):1460-1473. doi:10.4049/jimmunol.1601415.

which may be downloaded from the NCBI Sequence Read Archive under BioProject accession ID: PRJNA248475. For this example, we will use the first 25,000 sequences of sample HD09_N_AB8KB (accession: SRR4026043), which may downloaded using fastq-dump from the SRA Toolkit:

fastq-dump --split-files -X 25000 SRR4026043

Primer sequences are available online from the protocols/AbSeq directory in the Immcantation repository.

Read Configuration

_images/VanderHeiden2017_ReadConfiguration.svg

Schematic of the Illumina MiSeq 325+275 paired-end reads with UMI barcodes. Each read was sequenced from one end of the target cDNA so that the two reads together cover the entire variable region of the Ig heavy chain. Sequencing was performed using an uneven number of cycles for read 1 and read 2 using a 2x300 kit. The V(D)J reading frame proceeds from the start of read 2 to the start of read 1. Read 1 is in the opposite orientation (reverse complement), contains a partial C-region, and is 325 nucleotides in length. Read 2 contains the 5’RACE template switch site with a 17 nucleotide UMI barcode preceding it, and is 275 nucleotides in length.

Example Data

We have hosted a small subset of the data (Accession: SRR1383456) on the pRESTO website in FASTQ format with accompanying primer files. The sample data set and workflow script may be downloaded from here:

Vander Heiden et al, 2017 Example Files

Overview of the Workflow

In the following sections, we demonstrate each step of the workflow to move from raw sequence reads to a fully annotated repertoire of complete V(D)J sequences. The workflow is divided into four high-level tasks:

A graphical representation of the workflow along with the corresponding sequence of pRESTO commands is shown below.

Flowchart

_images/VanderHeiden2017_Flowchart.svg

Flowchart of processing steps. Each pRESTO tool is shown as a colored box. The workflow is divided into four primary tasks: (red) quality control, UMI annotation and primer masking; (orange) generation of UMI consensus sequences; (green) paired-end assembly of UMI consensus sequences; and (blue) deduplication and filtering to obtain the high-fidelity repertoire. Grey boxes indicate the initial and final data files. The intermediate files output by each tool are not shown for the sake of brevity.

Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env bash
FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log
MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
    --start 0 --mode cut --outname HD09N-R1 --log MP1.log
MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
    --start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log
PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
    --2f BARCODE --coord sra
BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
    --maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log
PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
    --coord presto
AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
    -2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
    --coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
    --aligner blastn --outname HD09N-C --log AP.log
MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
    -p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
    --mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log 
ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min
CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
    --uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C
SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
    -f CONSCOUNT --num 2 --outname HD09N-C
ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT
ParseLog.py -l FS1.log FS2.log -f ID QUALITY
ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR
ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY

Download Commands

Quality control, UMI annotation and primer masking

Removal of low quality reads

Quality control begins with the identification and removal of low-quality reads using the quality subcommand of the FilterSeq.py tool. In this example, reads with mean Phred quality scores less than 20 (-q 20) are removed:

2
3
FilterSeq.py quality -s SRR4026043_1.fastq -q 20 --outname HD09N-R1 --log FS1.log
FilterSeq.py quality -s SRR4026043_2.fastq -q 20 --outname HD09N-R2 --log FS2.log

The ParseLog.py tool is then used to extract results from the FilterSeq.py logs into tab-delimited files:

29
ParseLog.py -l FS1.log FS2.log -f ID QUALITY

Extracting the following information from the log:

Field

Description

ID

Sequence name

QUALITY

Quality score

UMI annotation and removal of primer regions

Next, the score subcommand of MaskPrimers.py is used to identify and remove the PCR primers for both reads. When dealing with Ig sequences, it is important to cut or mask the primers, as B cell receptors are subject to somatic hypermutation (the accumulation of point mutations in the DNA) and degenerate primer matches can look like mutations in downstream applications. The MaskPrimers tool is also used to annotate each read 2 sequence with the 17 nucleotide UMI that precedes the template switch site. (MaskPrimers score --barcode):

4
5
6
7
MaskPrimers.py score -s HD09N-R1_quality-pass.fastq -p AbSeq_R1_Human_IG_Primers.fasta \
    --start 0 --mode cut --outname HD09N-R1 --log MP1.log
MaskPrimers.py score -s HD09N-R2_quality-pass.fastq -p AbSeq_R2_TS.fasta \
    --start 17 --barcode --mode cut --maxerror 0.5 --outname HD09N-R2 --log MP2.log

To summarize these steps, the ParseLog.py tool is used to build a tab-delimited file from the MaskPrimers.py log:

30
ParseLog.py -l MP1.log MP2.log MP3.log -f ID PRIMER BARCODE ERROR

Containing the following information:

Field

Description

ID

Sequence name

PRIMER

Primer name

BARCODE

UMI sequence

ERROR

Primer match error rate

Note

For this data, we are using the 5’RACE template switch sequences as proxy primers. We’ve set the match error rate extremely high (--maxerror 0.5) because an accurate match isn’t important. Mostly, we are just concerned with extracting the UMI barcode that precedes the template switch site.

Generation of UMI consensus sequences

Copying the UMI annotation across paired-end files

In this task, a single consensus sequence is constructed for each set of reads annotated with the same UMI barcode. As the UMI barcode is part of read 2, the BARCODE annotation identified by MaskPrimers.py must first be copied to the read 1 mate-pair of each read 2 sequence. Propogation of annotations between mate pairs is performed using PairSeq.py which also removes unpaired reads and ensures that paired reads are sorted in the same order across files:

8
9
PairSeq.py -1 HD09N-R1_primers-pass.fastq -2 HD09N-R2_primers-pass.fastq \
    --2f BARCODE --coord sra

Note

For both the PairSeq.py and AssemblePairs.py commands using the correct --coord argument is critical for matching mate-pairs. If this was raw data from Illumina, rather than data downloaded from SRA, then the appropriate argument would be --coord illumina.

Multiple alignment of UMI read groups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar. This type of primer misalignment can be corrected using the AlignSets.py tool.

This step is not required for this data, but if necessary for your data see the section on fixing UMI problems.

Generating UMI consensus reads

After alignment, a single consensus sequence is generated for each UMI barcode using BuildConsensus.py:

10
11
12
13
BuildConsensus.py -s HD09N-R1_primers-pass_pair-pass.fastq --bf BARCODE --pf PRIMER \
    --prcons 0.6 --maxerror 0.1 --maxgap 0.5 --outname HD09N-R1 --log BC1.log
BuildConsensus.py -s HD09N-R2_primers-pass_pair-pass.fastq --bf BARCODE \
    --maxerror 0.1 --maxgap 0.5 --outname HD09N-R2 --log BC2.log

To correct for UMI chemistry and sequencing errors, UMI read groups having high error statistics (mismatch rate from consensus) are removed by specifiying the --maxerror 0.1 threshold. Additional filtering of read 1 is carried out during this step by specifying the --prcons 0.6 threshold which: (a) removes individual sequences that do not share a common primer annotation with the majority of the set, (b) removes entire read groups which have ambiguous primer assignments, and (c) constructs a consensus primer assignment for each UMI.

Note

The --maxgap 0.5 argument tells BuildConsensus.py to use a majority rule to delete any gap positions which occur in more than 50% of the reads. The --maxgap argument is not really necessary for this example data set as we did not perform a multiple alignment of the UMI read groups. However, if you have performed an alignment, then use of --maxgap during consensus generation is highly recommended.

The ParseLog.py tool is then used to build a tab-delimited file contain the consensus results:

31
ParseLog.py -l BC1.log BC2.log -f BARCODE SEQCOUNT CONSCOUNT PRCONS PRFREQ ERROR

With the following annotations:

Field

Description

BARCODE

UMI sequence

SEQCOUNT

Number of total reads in the UMI group

CONSCOUNT

Number of reads used for the UMI consensus

PRCONS

Consensus primer name

PRFREQ

Frequency of primers in the UMI group

ERROR

Average mismatch rate from consensus

Paired-end assembly of UMI consensus sequences

Syncronizing paired-end files

Following UMI consensus generation, the read 1 and read 2 files may again be out of sync due to differences in UMI read group filtering by BuildConsensus.py. To synchronize the reads another instance of PairSeq.py must be run, but without any annotation manipulation:

14
15
PairSeq.py -1 HD09N-R1_consensus-pass.fastq -2 HD09N-R2_consensus-pass.fastq \
    --coord presto

Assembling UMI consensus mate-pairs

Once the files have been synchronized, each paired-end UMI consensus sequence is assembled into a full length Ig sequence using the sequential subcommand of AssemblePairs.py. 5’RACE creates long amplicons which are not guaranteed to overlap when sequenced using a 2x300 kit. The sequential subcommand will first attempt de novo paired-end assembly. If that approach fails, it will attempt assembly guided by the V segment reference sequences specified by the -r argument. Mate-pairs that fail to overlap can thus be assembled, with the full length sequence containing an appropriate number of intervening between the two reads.

16
17
18
19
AssemblePairs.py sequential -1 HD09N-R2_consensus-pass_pair-pass.fastq \
    -2 HD09N-R1_consensus-pass_pair-pass.fastq -r IMGT_Human_IG_V.fasta \
    --coord presto --rc tail --scanrev --1f CONSCOUNT --2f CONSCOUNT PRCONS \
    --aligner blastn --outname HD09N-C --log AP.log

During assembly, the consensus primer annotation (PRCONS) from read 1 and the number of reads used to define the consensus sequence (CONSCOUNT) for both reads are propagated into the annotations of the full length Ig sequence (--1f CONSCOUNT --2f CONSCOUNT PRCONS.

ParseLog.py is then uses to extract the results from the AssemblePairs.py log into a tab-delimited file:

32
ParseLog.py -l AP.log -f ID REFID LENGTH OVERLAP GAP ERROR IDENTITY

Containing the following information:

ID REFID LENGTH OVERLAP GAP ERROR IDENTITY

Field

Description

ID

Sequence name (UMI)

REFID

The reference sequence name for reference guided assembly

LENGTH

Length of the assembled sequence

OVERLAP

Length of the overlap between mate-pairs

GAP

Length of the gap between mate-pairs

ERROR

Mismatch rate of the overlapping region from de novo assembly

IDENTITY

Identity score for the aligned region in reference guided assembly

Deduplication and filtering

Annotating sequences with the constant region

In the final stage of the workflow, the high-fidelity Ig repertoire is obtained by a series of filtering steps. First, an additional annotation defining the constant region is added to each sequence by aligning against the constant region sequence internal to the primer. This step is not strictly necessary, as the constant region primer is usually sufficient for isotype calls. However, primer calls may be inaccurate with some primer sets, so this additional confirmatory step is beneficial.

Internal constant regions are assigned using the align subcommand of MaskPrimers.py with a set of manually constructed sequences as the “primers” (-p AbSeq_Human_IG_InternalCRegion.fasta):

20
21
22
MaskPrimers.py align -s HD09N-C_assemble-pass.fastq \
    -p AbSeq_Human_IG_InternalCRegion.fasta --maxlen 100 --maxerror 0.3 \
    --mode tag --revpr --skiprc --pf CREGION --outname HD09N-C --log MP3.log 

Here we use a high error rate for the match (--maxerror 0.3), restrict the match to 100 nucleotides at the end of the sequence (--maxlen 100 --revpr --skiprc), and use mode the argument --mode tag to specify that the input sequence should not be modified. The aligned C-region name will be added to each sequence headers in the CREGION field (--pf CREGION):

Combining UMI read group size annotations

The annotation specifying the number of raw reads used to build each sequence (-f CONSCOUNT) is updated to be the minimum (--act min) of the forward and reverse reads using the collapse subcommand of ParseHeaders.py:

23
ParseHeaders.py collapse -s HD09N-C_primers-pass.fastq -f CONSCOUNT --act min

Removal of duplicate sequences

Second, duplicate nucleotide sequences are removed using the CollapseSeq.py tool with the requirement that duplicate sequences share the same constant region annotation (--uf CREGION). The duplicate removal step also removes sequences with a high number of interior N-valued nucleotides (-n 20 and --inner) and combines the read counts for each UMI read group (--cf CONSCOUNT and --act sum).

24
25
CollapseSeq.py -s HD09N-C_primers-pass_reheader.fastq -n 20 --inner \
    --uf CREGION --cf CONSCOUNT --act sum --outname HD09N-C

Filtering to sequences with at least two representative reads

Finally, unique sequences are filtered to those with at least 2 contributing sequences using the group subcommand of SplitSeq.py, by splitting the file on the CONSCOUNT annotation with a numeric threshold (-f CONSCOUNT and --num 2):

26
27
SplitSeq.py group -s HD09N-C_collapse-unique.fastq \
    -f CONSCOUNT --num 2 --outname HD09N-C

Creating an annotation table

For further analysis, the annotations of the final repertoire are then converted to into a table using the table subcommand of ParseHeaders.py:

28
ParseHeaders.py table -s HD09N-C_atleast-2.fastq -f ID CREGION CONSCOUNT DUPCOUNT

Output files

The final set of sequences, which serve as input to a V(D)J reference aligner (Eg, IMGT/HighV-QUEST or IgBLAST), and tables that can be plotted for quality control are:

File

Description

HD09N-C_collapse-unique.fastq

Total unique sequences

HD09N-C_atleast-2.fastq

Unique sequences represented by at least 2 reads

HD09N-C_atleast-2_headers.tab

Annotation table of the atleast-2 file

FS1_table.tab

Table of the read 1 FilterSeq log

FS2_table.tab

Table of the read 2 FilterSeq log

MP1_table.tab

Table of the C-region primer MaskPrimers log

MP2_table.tab

Table of the TS site MaskPrimers log

MP3_table.tab

Table of the internal C-region MaskPrimers log

BC1_table.tab

Table of the read 1 BuildConsensus log

BC2_table.tab

Table of the read 2 BuildConsensus log

AP_table.tab

Table of the AssemblePairs log

A number of other intermediate and log files are generated during the workflow, which allows easy tracking/reversion of processing steps. These files are not listed in the table above.

Importing Data

Importing data from SRA, ENA, or GenBank

If you have downloaded a data set from GenBank, SRA, or ENA, the format of the sequences headers are different from the raw Roche 454 and Illumina header format. As such, they may or may not be compatible with pRESTO (depending on how the headers have been modified by the sequence archive). ConvertHeaders.py allows you to change incompatible header formats into the pRESTO format. For example, to convert from SRA or ENA headers the sra subcommand would be used:

ConvertHeaders.py sra -s reads.fastq

ConvertHeaders.py provides the following conversion subcommands:

Subcommand

Formats Converted

generic

Headers with an unknown annotation system

454

Roche 454

genbank

NCBI GenBank and RefSeq

illumina

Illumina HiSeq or MiSeq

imgt

IMGT/GENE-DB

migec

Molecular Identifier Guided Error Correction

sra

NCBI SRA or EMBL-EBI ENA

Manipulating Annotations

The ParseHeaders.py tool provides a collection of methods for performing simple manipulations of sequence headers that are formatted in the pRESTO annotation scheme.

For converting sequence headers into the pRESTO format, see the Importing Data documentation.

Adding a sample annotation

Addition of annotation values is accomplished using the add subcommand of ParseHeaders.py:

ParseHeaders.py add -s reads.fastq -f SAMPLE -u A1

which will add the annotation SAMPLE=A1 to each sequence of the input file.

Expanding and renaming annotations

By default, pRESTO will not delete annotations. If a sequence header already contains an annotation that a tool is trying to add, it will not overwrite that annotation. Instead, it will append the annotation value to the values already present in a comma delimited form. For example, after two interations of MaskPrimers.py with the default primer field name PRIMER, you will have an annotation in the following form (reflecting a match against primer VH3 in the first iteration and primer IGHM in the second):

PRIMER=VH3,IGHM

Separating these annotations into two annotations is accomplished via the expand subcommand of ParseHeaders.py:

ParseHeaders.py expand -s reads.fastq -f PRIMER

Resulting in the annotations:

PRIMER1=VH3|PRIMER2=IGHM

which may then be renamed via the rename subcommand: expand subcommand of ParseHeaders.py:

ParseHeaders.py rename -s reads_reheader.fastq -f PRIMER1 PRIMER2 \
    -k VPRIMER CPRIMER

Copying, merging and collapsing annotations

Nested annotations can be generated using the copy or merge subcommands of ParseHeaders.py. The examples that follow will use the starting annotation:

UMI=ATGC|CELL=GGCC|COUNT=10,2

The UMI and CELL annotations can be combined into a single INDEX annotation using the following command:

ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete
# result> COUNT=10,2|INDEX=ATGC,GGCC

Without the --delete argument, the original UMI and CELL annotations would be kept in the header.

The nested annotation values can then be combined using the collapse subcommand to create various effects:

ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act cat
# result> INDEX=ATGCGGCC

ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act first
# result> INDEX=ATGC

ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act sum
# result> COUNT=12

ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act min
# result> COUNT=2

where the --act argument specifies the type of collapse action to perform.

The copy subcommand is normally used to create duplicate annotations with different names, but will have a similar effect to the merge subcommand when the target is an existing field:

ParseHeaders.py copy -s reads.fasta -f UMI -k CELL
# result> UMI=ATGC|CELL=GGCC,ATGC|COUNT=10,2

Both the copy and merge subcommands have an --act argument which allows you to perform an action from the collapse subcommand in the same step as the copy or merge:

ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete --act cat
# result> COUNT=10,2|INDEX=ATGCGGCC

ParseHeaders.py copy -s reads.fasta -f UMI -k CELL --act cat
# result> UMI=ATGC|CELL=GGCCATGC|COUNT=10,2

Deleting annotations

Unwanted annotations can be deleted using the delete subcommand of ParseHeaders.py:

ParseHeaders.py delete -s reads.fastq -f PRIMER

which will remove the PRIMER field from each sequence header.

Filtering, Subsetting and Converting

Cleaning or removing poor quality sequences

Data sets can be cleaned using one or more invocations of FilterSeq.py, which provides multiple sequence quality control operations. Four subcommands remove sequences from the data that fail to meet some threshold: including length, (length), number of N or gap characters (missing), homopolymeric tract length (repeats), or mean Phred quality score (quality). Two subcommands modify sequences without removing them: trimqual truncates the sequences when the mean Phred quality scores decays under a threshold, and maskqual replaces positions with low Phred quality scores with N characters.

FilterSeq.py provides the following quality control subcommands:

Subcommand

Operation

length

Removes short sequences

missing

Removes sequences with too many Ns or gaps

repeats

Removes sequences with long homopolymeric tracts

quality

Removes sequences with low mean quality scores

trimqual

Truncates sequences where quality scores decay

maskqual

Masks low quality positions

Subsetting sequence files by annotation

The group subcommand of SplitSeq.py allows you to split one file into multiple files based on the values in a sequence annotation. For example, splitting one file with multiple SAMPLE annotations into separate files (one for each sample) would be accomplished by:

SplitSeq.py group -s reads.fastq -f SAMPLE

which will create a set of files labelled SAMPLE-M1 and SAMPLE-M2 if samples are named M1 and M2.

If you wanted to split based on a numeric value, rather than a set of categorical values, then you would add the --num argument. SplitSeq.py would then create two files: one containing sequences with values less than the threshold specified by the --num argument and one file containing sequences with values greater than or equal to the threshold:

SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2

which will create two files with the labels atleast-2 and under-2.

Random sampling from sequence files

The sample subcommand of SplitSeq.py may be used to generate a random sample from a sequence file or set of pair-end files. The example below will select a random sample of 1,000 sequences (-n 1000) which all contain the annotation SAMPLE=M1 (-f SAMPLE and -u M1):

SplitSeq.py sample -s reads.fastq -f SAMPLE -u M1 -n 1000

Performing an analogous sampling of Illumina paired-end reads would be accomplished using the samplepair subcommand:

SplitSeq.py samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina

Note

Both the -f and -n arguments will accept a list of values (eg, -n 1000 100 10), allowing you to sample multiple times from multiple files in one command.

Converting to FASTA for IMGT/HighV-QUEST or IgBLAST

IMGT/HighV-QUEST and IgBLAST both require sequences in FASTA format. You can use the Immcantation script fastq2fasta.py to convert .fastq to .fasta. The script is available from the repository and is pre-installed in the Docker container.

fastq2fasta.py reads.fastq

Alternatively, you can request SplitSeq.py to output FASTA files by using the flag --fasta. In the example workflows, a common last step in the data processing pipelines is filtering sequences with at least two representative reads. By adding --fasta to the command, the output file will be a .fasta file.

SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2 --fasta

Reducing file size for submission to IMGT/HighV-QUEST

IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of SplitSeq.py to divide your files into small pieces.

SplitSeq.py count -s reads.fastq -n 500000 --fasta

The -n 500000 argument sets the maximum number of sequences in each file and the --fasta tells the tool to output a FASTA, rather than FASTQ, formatted file.

Note

You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using the CollapseSeq.py tool.

Isotype and Primer Annotations

Assigning isotype annotations from the constant region sequence

MaskPrimers.py is usually used to remove primer regions and annotate sequences with primer identifiers. However, it can be used for any other case where you need to align a set of short sequences against the reads. One example of an alternate use is where you either do not know the C-region primer sequences or do not trust the primer region to provide an accurate isotype assignment.

If you build a FASTA file containing the reverse-complement of short sequences from the front of CH-1, then you can annotate the reads with these sequence in the same way you would C-region specific primers:

MaskPrimers.py align -s reads.fastq -p IGHC.fasta --maxlen 100 --maxerror 0.3 \
    --mode cut --revpr --pf C_CALL

where --revpr tells MaskPrimers.py to reverse-complement the “primer” sequences and look for them at the end of the reads, --maxlen 100 restricts the search to the last 100 bp, --maxerror 0.3 allows for up to 30% mismatches, and -p IGHC.fasta specifies the file containing the CH-1 sequences. The name of the C-region will be added to the sequence headers as the C_CALL annotation, where the field name is specified by the --pf argument. An example CH-1 sequence file would look like:

>IGHD
CTGATATGATGGGGAACACATCCGGAGCCTTGGTGGGTGC
>IGHM
AGGAGACGAGGGGGAAAAGGGTTGGGGCGGATGCACTCCC
>IGHG
AGGGYGCCAGGGGGAAGACSGATGGGCCCTTGGTGGAAGC
>IGHA
MGAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGGGGATGC
>IGHE
AGCGGGTCAAGGGGAAGACGGATGGGCTCTGTGTGGAGGC

Download IGHC.fasta

See also

Constant region reference sequences may be downloaded from IMGT and the sequence headers can be reformated using the imgt subcommand of ConvertHeaders.py. Note, you may need to clean-up the reference sequences a bit before running ConvertHeaders.py if you receive an error about duplicate sequence names (e.g., remove duplicate allele with different artificial splicing). To cut and reverse-complement the constant region sequences, use something like seqmagick.

Fixing Assembly Problems

Assembling paired-end reads that do not overlap

The typical way to assemble paired-end reads is via de novo assembly using the align subcommand of AssemblePairs.py. However, some sequences with long CDR3 regions may fail to assemble due to insufficient or completely absent overlaps between the mate-pairs. The reference or sequential subcommands can be used to assemble mate-pairs that do not overlap using the ungapped V-segment references sequences as a guide.

To handle such sequences in two separate steps, a normal align command would be performed first. The --failed argument is added so that the reads failing de novo alignment are output to separate files:

AssemblePairs.py align -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina --failed -outname align

Then the files labeled assemble-fail, along with the ungapped V-segment reference sequences (-r vref.fasta), would be input into the reference subcommand of AssemblePairs.py:

AssemblePairs.py reference -1 align-1_assemble-fail.fastq -2 align-2_assemble-fail.fastq \
     --rc tail -r vref.fasta --coord illumina --outname ref

This will result in two separate assemble-pass files - one from each step. You may process them separately or concatenate them together into a single file:

cat align_assemble-pass.fastq ref_assemble-pass.fastq > merged_assemble-pass.fastq

However, if you intend to processes them together, you may simplify this by perform both steps using the sequential subcommand, which will attempt de novo assembly followed by reference guided assembly if de novo assembly fails:

AssemblePairs.py sequential -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina -r vref.fasta

Note

The sequences output by the reference or sequential subcommands may contain an appropriate length spacer of Ns between any mate-pairs that do not overlap. The --fill argument can be specified to force AssemblePairs.py to insert the germline sequence into the missing positions, but this should be used with caution as the inserted sequence may not be biologically correct.

Fixing UMI Problems

Correcting misaligned V-segment primers and indels in UMI groups

Before generating a consensus for a set of reads sharing a UMI barcode, the sequences must be properly aligned. Sequences may not be aligned if more than one PCR primer is identified in a UMI read group - leading to variations in the the start positions of the reads. Ideally, each set of reads originating from a single mRNA molecule should be amplified with the same primer. However, different primers in the multiplex pool may be incorporated into the same UMI read group during amplification if the primers are sufficiently similar.

_images/Primer_Alignment.svg

Correction of misaligned sequences. (A) Discrepancies in the location of primer binding (colored bases, with primer name indicated to the left) may cause misalignment of sequences sharing a UMI. (B) Following multiple alignment of the reads the non-primer regions are correctly aligned and suitable for UMI consensus generation.

This type of primer misalignment can be corrected using one of two approaches using the AlignSets.py tool.

Correcting via multiple alignment

The first approach, which is conceptually simpler but computationally more expensive, is to perform a full multiple alignment of reach UMI read group using the muscle subcommand of AlignSets.py. The --bf BARCODE argument tells AlignSets.py to multiple align reads sharing the same BARCODE annotation. The --exec ~/bin/muscle is a pointer to where the MUSCLE executable is located:

AlignSets.py muscle -s reads.fastq --bf BARCODE --exec ~/bin/muscle

The above approach will also insert gaps into the sequences where an insertion/deletion has occured in the reads. As such, you will need to provide as reasonable gap character threshold to BuildConsensus.py, such as --maxgap 0.5, defining how you want to handle positions with gap characters when generating a UMI consensus sequence.

Note

Using the muscle subcommand, along with the --maxgap argument to BuildConsensus.py will also address issue with insertions/deletions in UMI read groups. Although in UMI read groups with a sufficient number of reads consensus generation will resolve insertions/deletions without the need for multiple alignment, as any misaligned reads will simply be washed out by the majority. Whether to perform a multiple alignment prior to consensus generation is a matter of taste. A multiple alignment may improve consensus quality in small UMI read groups (eg, less than 4 sequences), but the extent to which small UMI read groups should be trusted is debatable.

Correcting via an offset table

The second approach will correct only the primer regions and will not address insertions/deletions within the sequence, but is much quicker to perform. The first step involves creation of a primer offset table using the table subcommand of AlignSets.py:

AlignSets.py table -p primers.fasta --exec ~/bin/muscle

which performs a multiple alignment on sequences in primers.fasta (sequences shown in the primer alignment figure above) to generate a file containing a primer offset table:

primers_offsets-forward.tab
VP1    2
VP2    0
VP3    1

Then the offset table can be input into the offset subcommand of AlignSets.py to align the reads:

AlignSets.py offset -s reads.fastq -d primers_offsets-forward.tab \
    --bf BARCODE --pr VPRIMER --mode pad

In the above command we have specified the field containing the primer annotation using --pf VPRIMER and set the behavior of the tool to add gap characters to align the reads with the --mode pad argument. These options will generate the correction shown in (B) of the primer alignment figure above. Alternatively, we could have deleted unalign positions using the argument --mode cut.

Note

You may need to alter how the offset table is generated if you have used the --mode cut argument to MaskPrimers.py rather than --mode mask, as this will cause the ends of the primer regions, rather than the front, to be the cause of the ragged edges within the UMI read groups. For primers that have been cut you would add the --reverse argument to the table operation of AlignSets.py, which will create an offset table that is based on the tail end of the primers.

Dealing with insufficient UMI diversity

Due to errors in the UMI region and/or insufficient UMI length, UMI read groups are not always homogeneous with respect to the mRNA of origin. This can cause difficulties in generating a valid UMI consensus sequence. In most cases, the --prcons and --maxerror (or --maxdiv) arguments to BuildConsensus.py are sufficient to filter out invalid reads and/or entire invalid UMI groups. However, if there is significant nucleotide diversity within UMI groups due to insufficient UMI length or low UMI diversity, the set command of the ClusterSets.py tool can help correct for this. ClusterSets.py will cluster sequence by similarity and add an additional annotation dividing sequences within a UMI read group into sub-clusters:

ClusterSets.py set -s reads.fastq -f BARCODE -k CLUSTER --exec ~/bin/usearch

The above command will add an annotation to each sequence named CLUSTER (-k CLUSTER) containing a cluster identifier for each sequence within the UMI barcode group. The -f BARCODE argument specifies the UMI annotation and --exec ~/bin/usearch is a pointer to where the USEARCH executable is located. After assigning cluster annotations via ClusterSets.py, the BARCODE and CLUSTER fields can be merged using the copy operation of ParseHeaders.py:

ParseHeaders.py copy -s reads_cluster-pass.fastq -f BARCODE -k CLUSTER --act cat

which will copy the UMI annotation (-f BARCODE) into the cluster annotation (-k CLUSTER) and concatenate them together (--act cat). Thus converting the annotations from:

>SEQ1|BARCODE=ATGTCG|CLUSTER=1
>SEQ2|BARCODE=ATGTCG|CLUSTER=2

to:

>SEQ1|BARCODE=ATGTCG|CLUSTER=1ATGTCG
>SEQ2|BARCODE=ATGTCG|CLUSTER=2ATGTCG

You may then specify --bf CLUSTER to BuildConsensus.py to tell it to generate UMI consensus sequences by UMI sub-cluster, rather than by UMI barcode annotation.

Combining split UMIs

Typically, a UMI barcode is attached to only one end of a paired-end mate-pair and can be copied to other read by a simple invocation of PairSeq.py. But in some cases, the UMI may be split such that there are two UMIs, each located on a different mate-pair. To deal with these sorts of UMIs, you would first employ PairSeq.py similarly to how you would in the single UMI case:

PairSeq.py -1 reads-1.fastq -2 reads-2.fastq --1f BARCODE --2f BARCODE \
    --coord illumina

The main difference from the single UMI case is that the BARCODE annotation is being simultaneously copied from read 1 to read 2 (--1f BARCODE) andf rom read 2 to read 1 (--2f BARCODE). This creates a set of annotations that look like:

>READ1|BARCODE=ATGTCGTT,GGCTAGTC
>READ2|BARCODE=ATGTCGTT,GGCTAGTC

Alternatively, these annotations can be combined upon copy using the --act cat argument:

PairSeq.py -1 reads-1.fastq -2 reads-2.fastq --1f BARCODE --2f BARCODE \
    --coord illumina --act cat

which concatenates the two values in the BARCODE field, yielding UMI annotations suitable for input to BuildConsensus.py:

>READ1|BARCODE=ATGTCGTTGGCTAGTC
>READ2|BARCODE=ATGTCGTTGGCTAGTC

Compensating for errors in the UMI region

Depending on the protocol used for library preparation, PCR error and sequencing error can significantly affect UMI and sequence assignments. To account for this error, the following approach can be used.

Clustering UMI sequences

First, errors in the UMI region can be accounted for by reassigning UMI groups to new clusters of UMI sequence that may differ by one or more nucleotides. To identify the ideal threshold at which to cluster similar UMI sequences, EstimateError.py can be run on the UMI field (BARCODE):

EstimateError.py barcode -s reads.fastq -f BARCODE

The -f BARCODE defines the header annotation containing the UMI sequence. This outputs the following tables:

File

Error profile

reads_distance-barcode.tab

Distribution of pairwise hamming distances

reads_threshold-barcode.tab

Recommended threshold

The value in the THRESHOLD column associated with the ALL row in reads_threshold-barcode.tab specifies a recommended threshold for clustering the UMI sequences.

Note

Subsampling at a depth to approximately 5,000 sequences is recommended to expedite this calculation. See the random task for an example of how to use SplitSeq.py to subsample sequence files.

The table specifies a threshold of 0.9 which will be used to cluster the UMI sequences via ClusterSets.py. The identity threshold is set via the argument --ident 0.9. Clustering will be performed on the sequences in the UMI annotation field (-f BARCODE) and UMI clusters will assigned to the annotation field INDEX_UMI via the argument -k INDEX_UMI:

ClusterSets.py barcode -s reads.fastq -f BARCODE -k INDEX_UMI --ident 0.9

Clustering V(D)J sequences

Next, sequences within these larger UMI clusters are clustered to avoid sequence collisions. Again, EstimateError.py is used to infer a clustering threshold, but instead of clustering UMI sequences the set subcommand is used to cluster the reads (V(D)J sequences) within the newly assigned UMI clusters (-f INDEX_UMI):

EstimateError.py set -s reads_cluster-pass.fastq -f INDEX_UMI

This outputs the following tables:

File

Error profile

reads_cluster-pass_distance-set.tab

Distribution of pairwise hamming distances

reads_cluster-pass_threshold-set.tab

Recommended threshold

The value in the THRESHOLD column associated with the ALL row in reads_cluster-pass_threshold-set.tab specifies a recommended threshold for resolving collisions.

Note

Subsampling at a depth to approximately 5,000 sequences is recommended to expedite this calculation. See the random task for an example of how to use SplitSeq.py to subsample sequence files.

Using a recommended threshold of 0.8, V(D)J sequences are clustering in a similar way using the set subcommand of ClusterSets.py:

ClusterSets.py set -s reads_cluster-pass.fastq -f INDEX_UMI -k INDEX_SEQ --ident 0.8

where the argument --ident 0.8 specifies the clustering threshold, -f INDEX_UMI defines the UMI cluster group to cluster within, and -k INDEX_SEQ defines the V(D)J sequence cluster annotation to add to the output headers.

Combining the UMI and V(D)J cluster annotations

Finally, new UMI groups can be generated by combining the two annotation fields generated during the clustering steps with the merge subcommand of ParseHeaders.py. The -f INDEX_UMI INDEX_SEQ argument defines the fields to combine and the -k INDEX_MERGE argument defines the new field that will contain the corrected UMI clusters used for consensus generation:

ParseHeaders.py merge -s reads_cluster-pass_cluster-pass.fastq -f INDEX_UMI INDEX_SEQ \
    -k INDEX_MERGE

This combined UMI-V(D)J sequence cluster annotation can then be specified as the barcode field to BuildConsensus.py using the --bf INDEX_MERGE argument.

Estimating sequencing and PCR error rates with UMI data

The EstimateError.py tool provides methods for estimating the combined PCR and sequencing error rates from large UMI read groups. The assumptions being, that consensus sequences generated from sufficiently large UMI read groups should be accurate representations of the true sequences, and that the rate of mismatches from consensus should therefore be an accurate estimate of the error rate in the data. However, this is not guaranteed to be true, hence this approach can only be considered an estimate of a data set’s error profile. The following command generates an error profile from UMI read groups with 50 or more sequences (-n 50), using a majority rule consensus sequence (--mode freq), and excluding UMI read groups with high nucleotide diversity (--maxdiv 0.1):

EstimateError.py -s reads.fastq -n 50 --mode freq --maxdiv 0.1

This generates the following tab-delimited files containing error rates broken down by various criteria:

File

Error profile

reads_error-position.tab

Error rates by read position

reads_error-quality.tab

Error rates by quality score

reads_error-nucleotide.tab

Error rates by nucleotide identity

reads_error-set.tab

Error rates by UMI read group size

Contact

If you have questions, you can email the Immcantation Group.

If you’ve discovered a bug or have a feature request, you can create an issue on Bitbucket using the Issue Tracker.

Citation

To cite pRESTO in publications please use:

Vander Heiden JA*, Yaari G*, Uduman M, Stern JNH, O’Connor KC, Hafler DA, Vigneault F, Kleinstein SH. pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires. Bioinformatics 2014; doi: 10.1093/bioinformatics/btu138

License

This work is licensed under the GNU Affero General Public License Version 3 (AGPL-3).