presto.Annotation

Annotation functions

presto.Annotation.addHeader(header, fields, values, delimiter=('|', '=', ','))

Adds fields and values to a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the list of fields to add or append to.

  • values – the list of annotation values to add for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.annotationConsensus(seq_iter, field, delimiter=('|', '=', ','))

Calculate a consensus annotation for a set of sequences

Parameters:
  • seq_iter – an iterator or list of SeqRecord objects

  • field – the annotation field to take a consensus of

  • delimiter – a tuple of delimiters for (annotations, field/values, value lists)

Returns:

Dictionary with keys

set containing a list of unique annotation values, count containing annotation counts, cons containing the consensus annotation, freq containing the majority annotation frequency

Return type:

dict

presto.Annotation.collapseAnnotation(ann_dict, action, fields=None, delimiter=('|', '=', ','))

Collapses multiple annotations into new single annotations for each field

Parameters:
  • ann_dict – dictionary of field/value pairs

  • action – collapse action to take; one of {min, max, sum, first, last, set, cat}

  • fields – subset of ann_dict to _collapse; if None, collapse all but the ID field

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns:

Modified field dictionary

Return type:

OrderedDict

presto.Annotation.collapseHeader(header, fields, actions, delimiter=('|', '=', ','))

Collapses a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the list of fields to collapse.

  • actions – the list of collapse action take; one of (max, min, sum, first, last, set, cat) for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.convert454Header(desc)

Parses 454 headers into the pRESTO format

Parameters:

desc (str) – a sequence description string.

Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

New style 454 header:

@<accession> <length=##>
@GXGJ56Z01AE06X length=222

Old style 454 header:

@<rank_x_y> <length=##> <uaccno=accession>
@000034_0199_0169 length=437 uaccno=GNDG01201ARRCR
presto.Annotation.convertGenbankHeader(desc, delimiter=('|', '=', ','))

Converts GenBank and RefSeq headers into the pRESTO format

Parameters:
  • desc (str) – a sequence description string.

  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).

Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

New style GenBank header:

<accession>.<version> <description>
>CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly

Old style GenBank header:

gi|<GI record number>|<dbsrc>|<accession>.<version>|<description>
>gi|568336023|gb|CM000663.2| Homo sapiens chromosome 1, GRCh38 reference primary assembly
presto.Annotation.convertGenericHeader(desc, delimiter=('|', '=', ','))

Converts any header to the pRESTO format

Parameters:
  • desc (str) – a sequence description string.

  • delimiter (tuple) – a tuple of delimiters for (fields, values, value lists).

Returns:

a dictionary of header field and value pairs.

Return type:

dict

presto.Annotation.convertIMGTHeader(desc, simple=False)

Converts germline headers from IMGT/GENE-DB into the pRESTO format

Parameters:
  • desc (str) – a sequence description string.

  • simple (bool) – if True then the header will be converted to only the allele name.

Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

IMGT header:

>X60503|IGHV1-18*02|Homo sapiens|F|V-REGION|142..417|276 nt|1| | | | |276+24=300|partial in 3'| |

Header contains 15 fields separated by | (http://imgt.org/genedb):

  1. IMGT/LIGM-DB accession number(s).

  2. Gene and allele name.

  3. Species.

  4. Functionality.

  5. Exon(s), region name(s), or extracted label(s).

  6. Start and end positions in the IMGT/LIGM-DB accession number(s).

  7. Number of nucleotides in the IMGT/LIGM-DB accession number(s).

  8. Codon start, or ‘NR’ (not relevant) for non coding labels and out-of-frame pseudogenes.

  9. Number of nucleotides added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB.

  10. Number of nucleotides added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB.

  11. Number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or ‘not corrected’ if non corrected sequencing errors.

  12. Number of amino acids (AA). This field indicates that the sequence is in amino acids.

  13. Number of characters in the sequence. Nucleotides (or AA) plus IMGT gaps.

  14. Partial (if it is).

  15. Reverse complementary (if it is).

presto.Annotation.convertIlluminaHeader(desc)

Converts Illumina headers into the pRESTO format

Parameters:

desc (str) – a sequence description string.

Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

New style Illumina header:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read number>:<is filtered>:<control number>:<index sequence>
@MISEQ:132:000000000-A2F3U:1:1101:14340:1555 2:N:0:ATCACG

Old style Illumina header:

@<instrument>:<flowcell lane>:<tile>:<x-pos>:<y-pos>#<index sequence>/<read number>
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
@MS6_33112:1:1101:18371:1066/1
presto.Annotation.convertMIGECHeader(desc)

Parses headers from the MIGEC tool into the pRESTO format

Parameters:

desc (str) – a sequence description string.

Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

MIGEC header:

@MIG UMI:<UMI sequence>:<consensus read count>
@MIG UMI:TCGGCCAACAAA:8
presto.Annotation.convertSRAHeader(desc)

Parses NCBI SRA or EMBL-EBI ENA headers into the pRESTO format

Parameters:

desc (str) – a sequence description string.

Returns:

a dictionary of header field and value pairs.

Return type:

dict

Examples

Header from fastq-dump --split-files:

@<accession>.<spot> <original sequence description> <length=#>
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
@SRR1383326.1 1 length=250

Header from fastq-dump --split-files -I:

@<accession>.<spot>.<read number> <original sequence description> <length=#>
@SRR1383326.1.1 1 length=250

Header from ENA:

@<accession>.<spot> <original sequence description>
@ERR220397.1 HKSQ1MM01DXT2W/3
@ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/1
@ERR346596.1 BS-DSFCONTROL04:4:000000000-A3F0Y:1:1101:12758:1640/2
presto.Annotation.copyHeader(header, fields, names, actions=None, delimiter=('|', '=', ','))

Copies fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – a list of the field names to copy.

  • names – a list of the new field names.

  • actions – the list of collapse action take after the copy; one of (max, min, sum, first, last, set, cat) for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.deleteHeader(header, fields, delimiter=('|', '=', ','))

Deletes fields from a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the list of fields to delete.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns:

modified header dictionary

Return type:

dict

presto.Annotation.expandHeader(header, fields, separator=',', delimiter=('|', '=', ','))

Splits and annotation value into separate fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – the field to split.

  • separator – the delimiter to split the values by.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.flattenAnnotation(ann_dict, delimiter=('|', '=', ','))

Converts annotations from a dictionary to a FASTA/FASTQ sequence description

Parameters:
  • ann_dict – Dictionary of field/value pairs

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns:

Formatted sequence description string

Return type:

str

presto.Annotation.getAnnotationValues(seq_iter, field, unique=False, delimiter=('|', '=', ','))

Gets the set of unique annotation values in a sequence set

Parameters:
  • seq_iter – Iterator or list of SeqRecord objects

  • field – Annotation field to retrieve values for

  • unique – If True return a list of only the unique values; if False return a list of all values

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns:

List of values for the field

Return type:

list

presto.Annotation.getCoordKey(header, coord_type='presto', delimiter=('|', '=', ','))

Return the coordinate identifier for a sequence description

Parameters:
  • header (str) – Sequence header string

  • coord_type (str) – Sequence header format; one of ‘illumina’, ‘solexa’, ‘sra’, ‘ena’, ‘454’, or ‘presto’; if unrecognized type or None, then return the input header.

  • delimiter (tuple) – Tuple of delimiters for (fields, values, value lists)

Returns:

Coordinate identifier as a string.

Return type:

str

presto.Annotation.mergeAnnotation(ann_dict_1, ann_dict_2, prepend=False, delimiter=('|', '=', ','))

Merges non-ID field annotations from one field dictionary into another

Parameters:
  • ann_dict_1 – Dictionary of field/value pairs to append to

  • ann_dict_2 – Dictionary of field/value pairs to merge with ann_dict_2

  • prepend – If True then add ann_dict_2 values to the front of any ann_dict_1 values that are already present, rather than the default behavior of appending ann_dict_2 values.

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns:

Modified ann_dict_1 dictonary of field/value pairs

Return type:

OrderedDict

presto.Annotation.mergeHeader(header, fields, name, action=None, delete=False, delimiter=('|', '=', ','))

Merges fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – a list of the field names to merge.

  • name – the name of the new field.

  • delete – if True delete the merged fields.

  • actions – the list of collapse action take after the merge one of (max, min, sum, first, last, set, cat).

  • delimiter – a tuple of delimiters for (fields, values, value lists)

Returns:

modified header dictionary.

Return type:

dict

presto.Annotation.parseAnnotation(record, fields=None, delimiter=('|', '=', ','))

Extracts annotations from a FASTA/FASTQ sequence description

Parameters:
  • record – Description string to extract annotations from

  • fields – List of fields to subset the return dictionary to; if None return all fields

  • delimiter – a tuple of delimiters for (fields, values, value lists)

Returns:

An OrderedDict of field/value pairs

Return type:

OrderedDict

presto.Annotation.parseLog(record)

Parses an pRESTO log record

Parameters:

record (str) – a string of lines representing a log record including newline characters.

Returns:

parsed log contain field and values pairs as a dictionary.

Return type:

collections.OrderedDict

presto.Annotation.renameAnnotation(ann_dict, old_field, new_field, delimiter=('|', '=', ','))

Renames an annotation and merges annotations if the new name already exists

Parameters:
  • ann_dict – Dictionary of field/value pairs

  • old_field – Old field name

  • new_field – New field name

  • delimiter – Tuple of delimiters for (fields, values, value lists)

Returns:

Modified fields dictonary

Return type:

OrderedDict

presto.Annotation.renameHeader(header, fields, names, actions=None, delimiter=('|', '=', ','))

Renames fields in a sequence header

Parameters:
  • header – an annotation dictionary returned by parseAnnotation.

  • fields – a list of the current field names.

  • names – a list of the new field names.

  • actions – the list of collapse action take after the rename; one of (max, min, sum, first, last, set, cat) for each field.

  • delimiter – a tuple of delimiters for (fields, values, value lists).

Returns:

modified header dictionary.

Return type:

dict