Overview

Scope and Features

pRESTO performs all stages of raw sequence processing prior to alignment against reference germline sequences. The toolkit is intended to be easy to use, but some familiarity with commandline applications is expected. Rather than providing a fixed solution to a small number of common workflows, we have designed pRESTO to be as flexible as possible. This design philosophy makes pRESTO suitable for many existing protocols and adaptable to future technologies, but requires users to construct a sequence of commands and options specific to their experimental protocol.

pRESTO is composed of a set of standalone tools to perform specific tasks, often with a series of subcommands providing different behaviors. A brief description of each tool is shown in the table below.

Tool

Subcommand

Description

AlignSets.py

Multiple aligns sets of sequences sharing the same annotation

muscle

Uses the program MUSCLE to align reads

offset

Uses a table of primer alignments to align the 5’ region

table

Creates a table of primer alignments for the offset subcommand

AssemblePairs.py

Assembles paired-end reads into a complete sequence

align

Assembles paired-end reads by aligning the sequence ends

join

Concatenates pair-end reads with intervening gaps

reference

Assembles paired-end reads using V-segment references

sequential

Attempt alignment assembly followed by reference assembly

BuildConsensus.py

Constructs UMI consensus sequences

ClusterSets.py

Clusters read groups

all

Cluster all sequences regardless of annotation

barcode

Cluster reads by clustering barcode sequences

set

Cluster reads by sequence data within barcode groups

CollapseSeq.py

Removes duplicate sequences

ConvertHeaders.py

Converts sequence headers to the pRESTO format

454

Converts Roche 454 sequence headers

genbank

Converts NCBI GenBank and RefSeq sequence headers

generic

Converts sequence headers with an unknown annotation system

illumina

Converts Illumina sequence headers

imgt

Converts sequence headers output by IMGT/GENE-DB

migec

Converts sequence headers output by MIGEC

sra

Converts NCBI SRA or EMBL-EBI ENA sequence headers

EstimateError.py

Estimates error rates for UMI data

barcode

Calculates pairwise distance metrics of barcode sequences

set

Estimates error statistics within annotation sets

FilterSeq.py

Removes or modifies low quality reads

length

Removes sequences under a defined length

maskqual

Masks low Phred quality score positions with Ns

missing

Removes sequences with a high number of Ns

quality

Removes sequences with low Phred quality scores

repeats

Removes sequences with long repeats of a single nucleotide

trimqual

Trims sequences to segments with high Phred quality scores

MaskPrimers.py

Identifies and removes primer regions, MIDs and UMI barcodes

align

Matches primers by local alignment and reorients sequences

extract

Removes and annotates a fixed sequence region

score

Matches primers at a fixed user-defined start position

PairSeq.py

Sorts paired-end reads and copies annotations between them

ParseHeaders.py

Manipulates sequence annotations

add

Adds a field and value annotation pair to all reads

collapse

Compresses a set of annotation fields into a single field

copy

Copies values between annotations fields

delete

Deletes an annotation from all reads

expand

Expands an field with multiple values into separate annotations

merge

Merge multiple annotations fields into a single field

rename

Rename annotation fields

table

Outputs sequence annotations as a data table

ParseLog.py

Converts the log output of pRESTO scripts into data tables

SplitSeq.py

Performs conversion, sorting, and subsetting of sequence files

count

Splits files into smaller files

group

Splits files based on numerical or categorical annotation

sample

Randomly samples sequences from a file

samplepair

Randomly samples paired-end reads from two files

select

Filters sequences based on annotations

sort

Sorts sequences based on annotations

UnifyHeaders

Unifies annotation fields based on grouping scheme

consensus

Reassign fields to consensus values

delete

Delete sequences with differing field values.

Input and Output

All tools take as input standard FASTA or FASTQ formatted files and output files in the same formats. This allows pRESTO to work seamlessly with other sequence processing tools that use either of these data formats; any steps within a pRESTO workflow can be exchanged for an alternate tool, if desired.

Each tool appends a specific suffix to its output files describing the step and output. For example, MaskPrimers will append _primers-pass to the output file containing successfully aligned sequences and _primers-fail to the file containing unaligned sequences.

See also

Details regarding the suffixes used by pRESTO tools can be found in the Commandline Usage documentation for each tool.

Annotation Scheme

The majority of pRESTO tools manipulate and add sequences-specific annotations as part of their processing functions using the scheme shown below. Each annotation is delimited using a reserved character (| by default), with the annotation field name and values separated by a second reserved character (= by default), and each value within a field is separated by a third reserved character (, by default). These annotations follow the sequence identifier, which itself immediately follows the > (FASTA) or @ (FASTQ) symbol denoting the beginning of a new sequence entry. The sequence identifier is given the reserved field name ID. To mitigate potential analysis errors, each tool in pRESTO annotates sequences by appending values to existing annotation fields when they exist, and will not overwrite or delete annotations unless explicitly performed using the ParseHeaders tool. All reserved characters can be redefined using the command line options.

FASTA Annotation
>SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
FASTQ Annotation
@SEQUENCE_ID|PRIMER=IgHV-6,IgHC-M|BARCODE=DAY7|DUPCOUNT=8
NNNNCCACGATTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTTTCTACCAAAA
+
!!!!nmoomllmlooj\Xlnngookkikloommononnoonnomnnlomononoojlmmkiklonooooooooomoo

See also

  • Details regarding the annotations added by pRESTO tools can be found in the Commandline Usage documentation for each tool.

  • The ParseHeaders.py tool provides a number of options for manipulating annotations in the pRESTO format.

  • The ConvertHeaders.py tool allows you convert several common annotation schemes into the pRESTO annotation format.