presto.Annotation

Annotation functions

presto.Annotation.addHeader(header, fields, values, delimiter=('|', '=', ', '))

Adds fields and values to a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the list of fields to add or append to.
  • values – the list of annotation values to add for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.annotationConsensus(seq_iter, field, delimiter=('|', '=', ', '))

Calculate a consensus annotation for a set of sequences

Parameters:
  • seq_iter – an iterator or list of SeqRecord objects
  • field – the annotation field to take a consensus of
  • delimiter – a tuple of delimiters for (annotations, field/values, value lists)
Returns:

Dictionary with keys

set containing a list of unique annotation values, count containing annotation counts, cons containing the consensus annotation, freq containing the majority annotation frequency

Return type:

dict

presto.Annotation.collapseAnnotation(ann_dict, action, fields=None, delimiter=('|', '=', ', '))

Collapses multiple annotations into new single annotations for each field

Parameters:
  • ann_dict – dictionary of field/value pairs
  • action – collapse action to take; one of {min, max, sum, first, last, set, cat}
  • fields – subset of ann_dict to _collapse; if None, collapse all but the ID field
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Modified field dictionary

Return type:

OrderedDict

presto.Annotation.collapseHeader(header, fields, actions, delimiter=('|', '=', ', '))

Collapses a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the list of fields to collapse.
  • actions – the list of collapse action take; one of (max, min, sum, first, last, set, cat) for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.convert454Header(desc)

Parses 454 headers into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

_New style 454 header_ @<accession> <length=##> @GXGJ56Z01AE06X length=222

_Old style 454 header_ @<rank_x_y> <length=##> <uaccno=accession> @000034_0199_0169 length=437 uaccno=GNDG01201ARRCR

presto.Annotation.convertGenbankHeader(desc, delimiter=('|', '=', ', '))

Converts GenBank and RefSeq headers into the pRESTO format

Parameters:
  • desc (str) – a sequence description string.
  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).
Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

_New style GenBank header_ >CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly <accession>.<version> <description>

_Old style GenBank header_ gi|<GI record number>|<dbsrc>|<accession>.<version>|<description> >gi|568336023|gb|CM000663.2| Homo sapiens chromosome 1, GRCh38 reference primary assembly

presto.Annotation.convertGenericHeader(desc, delimiter=('|', '=', ', '))

Converts any header to the pRESTO format

Parameters:
  • desc (str) – a sequence description string.
  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).
Returns:

a dictionary of header field and value pairs.

Return type:

dict

presto.Annotation.convertIMGTHeader(desc, simple=False)

Converts germline headers from IMGT/GENE-DB into the pRESTO format

Parameters:
  • desc (str) – a sequence description string.
  • simple (bool) – if True then the header will be converted to only the allele name.
Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

_IMGT header_ Header specifications from http://imgt.org/genedb The FASTA header contains 15 fields separated by ‘|’:

  1. IMGT/LIGM-DB accession number(s)
  2. gene and allele name
  3. species
  4. functionality
  5. exon(s), region name(s), or extracted label(s)
  6. start and end positions in the IMGT/LIGM-DB accession number(s)
  7. number of nucleotides in the IMGT/LIGM-DB accession number(s)
  8. codon start, or ‘NR’ (not relevant) for non coding labels and out-of-frame pseudogenes
  9. +n: number of nucleotides (nt) added in 5’ compared to the corresponding label extracted from IMGT/LIGM-DB
  1. +n or -n: number of nucleotides (nt) added or removed in 3’ compared to the corresponding label extracted from IMGT/LIGM-DB
  2. +n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or ‘not corrected’ if non corrected sequencing errors
  3. number of amino acids (AA): this field indicates that the sequence is in amino acids
  4. number of characters in the sequence: nt (or AA)+IMGT gaps=total
  5. partial (if it is)
  6. reverse complementary (if it is)

>X60503|IGHV1-18*02|Homo sapiens|F|V-REGION|142..417|276 nt|1| | | | |276+24=300|partial in 3'| |

presto.Annotation.convertIlluminaHeader(desc)

Converts Illumina headers into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

_New style Illumina header_ @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read number>:<is filtered>:<control number>:<index sequence> @MISEQ:132:000000000-A2F3U:1:1101:14340:1555 2:N:0:ATCACG

_Old style Illumina header_ @<instrument>:<flowcell lane>:<tile>:<x-pos>:<y-pos>#<index sequence>/<read number> @HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1

presto.Annotation.convertMIGECHeader(desc)

Parses headers from the MIGEC tool into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

_MIGEC header_ @MIG UMI:<UMI sequence>:<consensus read count> @MIG UMI:TCGGCCAACAAA:8

presto.Annotation.convertSRAHeader(desc)

Parses NCBI SRA or EMBL-EBI ENA headers into the pRESTO format

Parameters:desc (str) – a sequence description string.
Returns:a dictionary of header field and value pairs.
Return type:dict

Examples

_Header from fastq-dump –split-files_ @<accession>.<spot> <original sequence description> <length=#> @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 @SRR1383326.1 1 length=250 _Header from fastq-dump –split-files -I_ @<accession>.<spot>.<read number> <original sequence description> <length=#> @SRR1383326.1.1 1 length=250 _Header from ENA_ @<accession>.<spot> <original sequence description> @ERR220397.1 HKSQ1MM01DXT2W/3 @ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/1 @ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/2

presto.Annotation.copyHeader(header, fields, names, actions=None, delimiter=('|', '=', ', '))

Copies fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – a list of the field names to copy.
  • names – a list of the new field names.
  • actions – the list of collapse action take after the copy; one of (max, min, sum, first, last, set, cat) for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.deleteHeader(header, fields, delimiter=('|', '=', ', '))

Deletes fields from a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the list of fields to delete.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary

Return type:

dict

presto.Annotation.expandHeader(header, fields, separator=', ', delimiter=('|', '=', ', '))

Splits and annotation value into separate fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – the field to split.
  • separator – the delimiter to split the values by.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.flattenAnnotation(ann_dict, delimiter=('|', '=', ', '))

Converts annotations from a dictionary to a FASTA/FASTQ sequence description

Parameters:
  • ann_dict – Dictionary of field/value pairs
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Formatted sequence description string

Return type:

str

presto.Annotation.getAnnotationValues(seq_iter, field, unique=False, delimiter=('|', '=', ', '))

Gets the set of unique annotation values in a sequence set

Parameters:
  • seq_iter – Iterator or list of SeqRecord objects
  • field – Annotation field to retrieve values for
  • unique – If True return a list of only the unique values; if False return a list of all values
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

List of values for the field

Return type:

list

presto.Annotation.getCoordKey(header, coord_type='presto', delimiter=('|', '=', ', '))

Return the coordinate identifier for a sequence description

Parameters:
  • header – Sequence header string
  • coord_type – Sequence header format; one of [‘illumina’, ‘solexa’, ‘sra’, ‘454’, ‘presto’]; if unrecognized type or None return sequence ID.
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Coordinate identifier as a string

Return type:

str

presto.Annotation.mergeAnnotation(ann_dict_1, ann_dict_2, prepend=False, delimiter=('|', '=', ', '))

Merges non-ID field annotations from one field dictionary into another

Parameters:
  • ann_dict_1 – Dictionary of field/value pairs to append to
  • ann_dict_2 – Dictionary of field/value pairs to merge with ann_dict_2
  • prepend – If True then add ann_dict_2 values to the front of any ann_dict_1 values that are already present, rather than the default behavior of appending ann_dict_2 values.
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Modified ann_dict_1 dictonary of field/value pairs

Return type:

OrderedDict

presto.Annotation.mergeHeader(header, fields, name, action=None, delete=False, delimiter=('|', '=', ', '))

Merges fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – a list of the field names to merge.
  • name – the name of the new field.
  • delete – if True delete the merged fields.
  • actions – the list of collapse action take after the merge one of (max, min, sum, first, last, set, cat).
  • delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.parseAnnotation(record, fields=None, delimiter=('|', '=', ', '))

Extracts annotations from a FASTA/FASTQ sequence description

Parameters:
  • record – Description string to extract annotations from
  • fields – List of fields to subset the return dictionary to; if None return all fields
  • delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:

An OrderedDict of field/value pairs

Return type:

OrderedDict

presto.Annotation.parseLog(record)

Parses an pRESTO log record

Parameters:record (str) – a string of lines representing a log record including newline characters.
Returns:parsed log contain field and values pairs as a dictionary.
Return type:collections.OrderedDict
presto.Annotation.renameAnnotation(ann_dict, old_field, new_field, delimiter=('|', '=', ', '))

Renames an annotation and merges annotations if the new name already exists

Parameters:
  • ann_dict – Dictionary of field/value pairs
  • old_field – Old field name
  • new_field – New field name
  • delimiter – Tuple of delimiters for (fields, values, value lists)
Returns:

Modified fields dictonary

Return type:

OrderedDict

presto.Annotation.renameHeader(header, fields, names, actions=None, delimiter=('|', '=', ', '))

Renames fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.
  • fields – a list of the current field names.
  • names – a list of the new field names.
  • actions – the list of collapse action take after the rename; one of (max, min, sum, first, last, set, cat) for each field.
  • delimiter – a tuple of delimiters for (fields, values, value lists).
Returns:

modified header dictionary.

Return type:

dict