Removes duplicate sequences from FASTA/FASTQ files

usage: [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                      [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                      [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                      [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                      [-n MAX_MISSING] [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
                      [--cf COPY_FIELDS [COPY_FIELDS ...]]
                      [--act {min,max,sum,set} [{min,max,sum,set} ...]]
                      [--inner] [--keepmiss]
                      [--maxf MAX_FIELD | --minf MIN_FIELD]

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.


If specified create files containing records that fail processing.


Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_missing>

Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides. Note, setting a value above 0 will consider ambiguous/missing nucleotides via a distance calculation, but is considerably more computationally expensive, especially on large data sets.

--uf <uniq_fields>

Specifies a set of annotation fields that must match for sequences to be considered duplicates.

--cf <copy_fields>

Specifies a set of annotation fields to copy into the unique sequence output.

--act {min,max,sum,set}

List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.


If specified, exclude consecutive missing characters at either end of the sequence.


If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.

--maxf <max_field>

Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.

--minf <min_field>

Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.

output files:

unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.


raw reads which are duplicates of the sequences retained in the collapse-unique file.


raw reads which were excluded from consideration due to having too many N characters in the sequence.

output annotation fields:

total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.

<user defined>

annotation fields specified by the –cf parameter.