EstimateError.py

Calculates annotation set error rates

usage: EstimateError.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
error-position

estimated error by read position.

error-quality

estimated error by the quality score assigned within the input file.

error-nucleotide

estimated error by nucleotide.

error-set

estimated error by annotation set size.

distance-set

pairwise hamming distances by annotation set.

threshold-set

thresholds from pairwise hamming distances for annotation sets.

distance-barcode

estimated error by pairwise hamming distances

threshold-barcode

thresholds from pairwise hamming distances for clustering barcodes

output fields:
POSITION

read position with base zero indexing.

Q

Phred quality score.

OBSERVED

observed nucleotide value.

REFERENCE

consensus nucleotide for the barcode read group.

SET_COUNT

barcode read group size.

REPORTED_Q

mean Phred quality score reported within the input file for the given position, quality score, nucleotide or read group.

MISMATCHES

count of observed mismatches from consensus for the given position, quality score, nucleotide or read group.

OBSERVATIONS

total count of observed values for each position, quality score, nucleotide or read group size.

ERROR

estimated error rate.

EMPIRICAL_Q

estimated error rate converted to a Phred quality score.

ALL

histogram (count) of all pairwise distance distribution.

DTN

histogram (count) of distance to nearest distribution.

DISTANCE

length normalized hamming distance.

EstimateError.py barcode

Calculates pairwise distance metrics of barcode sequences.

usage: EstimateError.py barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                                [--outdir OUT_DIR] [--outname OUT_NAME]
                                [--delim DELIMITER DELIMITER DELIMITER]
                                [-f BARCODE_FIELD] [--pad {none,head,tail}]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <barcode_field>

The name of the barcode field. Note, barcodes are expected to all be identical length. Barcode sequences shorter than the maximum barcode length will be excluded from the distance calculations.

--pad {none,head,tail}

Specifies the action to take for barcode sequences shorter than the maximum barcode length. The “none” action will exclude truncated barcodes from the distance calculations. The “head” and “tail” actions will add N characters to either the front or back, respectively, of truncated barcode sequence to give all barcodes identical length. N characters will be treated as mismatches in the distance calculation.

EstimateError.py set

Estimates error statistics within annotation sets.

usage: EstimateError.py set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                            [--outdir OUT_DIR] [--outname OUT_NAME]
                            [--log LOG_FILE]
                            [--delim DELIMITER DELIMITER DELIMITER]
                            [--nproc NPROC] [-f SET_FIELD] [-n MIN_COUNT]
                            [--mode {freq,qual}] [-q MIN_QUAL]
                            [--freq MIN_FREQ] [--maxdiv MAX_DIVERSITY]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-f <set_field>

The name of the annotation field to group sequences by

-n <min_count>

The minimum number of sequences needed to consider a set

--mode {freq,qual}

Specifies which method to use to determine the consensus sequence. The “freq” method will determine the consensus by nucleotide frequency at each position and assign the most common value. The “qual” method will weight values by their quality scores to determine the consensus nucleotide at each position.

-q <min_qual>

Consensus quality score cut-off under which an ambiguous character is assigned.

--freq <min_freq>

Fraction of character occurrences under which an ambiguous character is assigned.

--maxdiv <max_diversity>

Specify to calculate the nucleotide diversity of each read group (average pairwise error rate) and exclude groups which exceed the given diversity threshold.