Removes duplicate sequences from FASTA/FASTQ files

usage: CollapseSeq [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
                       [--failed] [--log LOG_FILE]
                       [--delim DELIMITER DELIMITER DELIMITER] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [-n MAX_MISSING]
                       [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
                       [--cf COPY_FIELDS [COPY_FIELDS ...]]
                       [--act {min,max,sum,set} [{min,max,sum,set} ...]] [--inner]
                       [--keepmiss] [--maxf MAX_FIELD | --minf MIN_FIELD]

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.


Specify to force output as FASTA rather than FASTQ.


If specified create files containing records that fail processing.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-n <max_missing>

Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides.

--uf <uniq_fields>

Specifies a set of annotation fields that must match for sequences to be considered duplicates.

--cf <copy_fields>

Specifies a set of annotation fields to copy into the unique sequence output.

--act {min,max,sum,set}

List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.


If specified, exclude consecutive missing characters at either end of the sequence.


If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.

--maxf <max_field>

Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.

--minf <min_field>

Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.

output files:
unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.
raw reads which are duplicates of the sequences retained in the collapse-unique file.
raw reads which were excluded from consideration due to having too many N characters in the sequence.
output annotation fields:
total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.
<user defined>
annotation fields specified by the –cf parameter.