presto.Sequence

Sequence processing functions

presto.Sequence.calculateDiversity(seq_list, score_dict=getDNAScoreDict())

Determine the average pairwise error rate for a list of sequences

Parameters:
  • seq_list – List of SeqRecord objects to score
  • score_dict – Optional dictionary of alignment scores as {(char1, char2): score}
Returns:

Average pairwise error rate for the list of sequences

Return type:

float

presto.Sequence.calculateSetError(seq_list, ref_seq, ignore_chars=['n', 'N'], score_dict=getDNAScoreDict())

Counts the occurrence of nucleotide mismatches from a reference in a set of sequences

Parameters:
  • seq_list – list of SeqRecord objects with aligned sequences.
  • ref_seq – SeqRecord object containing the reference sequence to match against.
  • ignore_chars – list of characters to exclude from mismatch counts.
  • score_dict – optional dictionary of alignment scores as {(char1, char2): score}.
Returns:

error rate for the set.

Return type:

float

presto.Sequence.checkSeqEqual(seq1, seq2, ignore_chars={'n', '-', 'N', '.'})

Determine if two sequences are equal, excluding missing positions

Parameters:
  • seq1 – SeqRecord object
  • seq2 – SeqRecord object
  • ignore_chars – Set of characters to ignore
Returns:

True if the sequences are equal

Return type:

bool

presto.Sequence.compilePrimers(primers)

Translates IUPAC Ambiguous Nucleotide characters to regular expressions and compiles them

Parameters:key – Dictionary of sequences to translate
Returns:Dictionary of compiled regular expressions
Return type:dict
presto.Sequence.deleteSeqPositions(seq, positions)

Deletes a list of positions from a SeqRecord

Parameters:
  • seq – SeqRecord objects
  • positions – Set of positions (indices) to delete
Returns:

Modified SeqRecord with the specified positions removed

Return type:

SeqRecord

presto.Sequence.findGapPositions(seq_list, max_gap, gap_chars={'-', '.'})

Finds positions in a set of aligned sequences with a high number of gap characters.

Parameters:
  • seq_list – List of SeqRecord objects with aligned sequences
  • max_gap – Float of the maximum gap frequency to consider a position as non-gapped
  • gap_chars – Set of characters to consider as gaps
Returns:

Positions (indices) with gap frequency greater than max_gap

Return type:

list

presto.Sequence.frequencyConsensus(seq_list, min_freq=0.6, ignore_chars={'n', '-', 'N', '.'})

Builds a consensus sequence from a set of sequences

Parameters:
  • set_seq – List of SeqRecord objects
  • min_freq – Frequency cutoff to assign a base
  • ignore_chars – Set of characters to exclude when building a consensus sequence
Returns:

Consensus SeqRecord object

Return type:

SeqRecord

presto.Sequence.getAAScoreDict(mask_score=None, gap_score=None)

Generates a score dictionary

Parameters:
  • mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score dictionary with keys (char1, char2) mapping to scores

Return type:

dict

presto.Sequence.getDNAScoreDict(mask_score=None, gap_score=None)

Generates a score dictionary

Parameters:
  • mask_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a [-, .] character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score dictionary with keys (char1, char2) mapping to scores

Return type:

dict

presto.Sequence.indexSeqSets(seq_dict, field='BARCODE', delimiter=('|', '=', ', '))

Identifies sets of sequences with the same ID field

Parameters:
  • seq_dict – a dictionary index of sequences returned from SeqIO.index()
  • field – the annotation field containing set IDs
  • delimiter – a tuple of delimiters for (fields, values, value lists)
Returns:

Dictionary mapping set name to a list of record names

Return type:

dict

presto.Sequence.qualityConsensus(seq_list, min_qual=20, min_freq=0.6, dependent=False, ignore_chars={'n', '-', 'N', '.'})

Builds a consensus sequence from a set of sequences

Parameters:
  • seq_list – List of SeqRecord objects
  • min_qual – Quality cutoff to assign a base
  • min_freq – Frequency cutoff to assign a base
  • dependent – If False assume sequences are independent for quality calculation
  • ignore_chars – Set of characters to exclude when building a consensus sequence
Returns:

Consensus SeqRecord object

Return type:

SeqRecord

presto.Sequence.reverseComplement(seq)

Takes the reverse complement of a sequence

Parameters:seq – a SeqRecord object, Seq object or string to reverse complement
Returns:Object of the same type as the input with the reverse complement sequence
Return type:Seq
presto.Sequence.scoreAA(a, b, mask_score=None, gap_score=None)

Returns the score for a pair of IUPAC Extended Protein characters

Parameters:
  • a – First character
  • b – Second character
  • mask_score – Tuple of length two defining scores for all matches against an X character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score for the character pair

Return type:

int

presto.Sequence.scoreDNA(a, b, mask_score=None, gap_score=None)

Returns the score for a pair of IUPAC Ambiguous Nucleotide characters

Parameters:
  • a – First characters
  • b – Second character
  • n_score – Tuple of length two defining scores for all matches against an N character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
  • gap_score – Tuple of length two defining score for all matches against a gap (-, .) character for (a, b), with the score for character (a) taking precedence; if None score symmetrically according to IUPAC character identity
Returns:

Score for the character pair

Return type:

int

presto.Sequence.scoreSeqPair(seq1, seq2, ignore_chars=set(), score_dict=getDNAScoreDict())

Determine the error rate for a pair of sequences

Parameters:
  • seq1 – SeqRecord object
  • seq2 – SeqRecord object
  • ignore_chars – Set of characters to ignore when scoring and counting the weight
  • score_dict – Optional dictionary of alignment scores
Returns:

Tuple of the (score, minimum weight, error rate) for the pair of sequences

Return type:

Tuple

presto.Sequence.subsetSeqIndex(seq_dict, field, values, delimiter=('|', '=', ', '))

Subsets a sequence set by annotation value

Parameters:
  • seq_dict – Dictionary index of sequences returned from SeqIO.index()
  • field – Annotation field to select keys by
  • values – List of annotation values that define the retained keys
  • delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns:

List of keys

Return type:

list

presto.Sequence.subsetSeqSet(seq_iter, field, values, delimiter=('|', '=', ', '))

Subsets a sequence set by annotation value

Parameters:
  • seq_iter – Iterator or list of SeqRecord objects
  • field – Annotation field to select by
  • values – List of annotation values that define the retained sequences
  • delimiter – Tuple of delimiters for (annotations, field/values, value lists)
Returns:

Modified list of SeqRecord objects

Return type:

list

presto.Sequence.translateAmbigDNA(key)

Translates IUPAC Ambiguous Nucleotide characters to or from character sets

Parameters:key – String or re.search object containing the character set to translate
Returns:Character translation
Return type:str
presto.Sequence.weightSeq(seq, ignore_chars=set())

Returns the length of a sequencing excluding ignored characters

Parameters:
  • seq – SeqRecord or Seq object
  • ignore_chars – Set of characters to ignore when counting sequence length
Returns:

Sum of the character scores for the sequence

Return type:

int