CollapseSeq

Removes duplicate sequences from FASTA/FASTQ files

usage: CollapseSeq [--version] [-h] -s SEQ_FILES [SEQ_FILES ...] [--fasta]
                       [--failed] [--log LOG_FILE]
                       [--delim DELIMITER DELIMITER DELIMITER] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [-n MAX_MISSING]
                       [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
                       [--cf COPY_FIELDS [COPY_FIELDS ...]]
                       [--act {min,max,sum,set} [{min,max,sum,set} ...]] [--inner]
                       [--keepmiss] [--maxf MAX_FIELD | --minf MIN_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

--fasta

Specify to force output as FASTA rather than FASTQ.

--failed

If specified create files containing records that fail processing.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-n <max_missing>

Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides.

--uf <uniq_fields>

Specifies a set of annotation fields that must match for sequences to be considered duplicates.

--cf <copy_fields>

Specifies a set of annotation fields to copy into the unique sequence output.

--act {min,max,sum,set}

List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.

--inner

If specified, exclude consecutive missing characters at either end of the sequence.

--keepmiss

If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.

--maxf <max_field>

Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.

--minf <min_field>

Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.

output files:
collapse-unique
unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.
collapse-duplicate
raw reads which are duplicates of the sequences retained in the collapse-unique file.
collapse-undetermined
raw reads which were excluded from consideration due to having too many N characters in the sequence.
output annotation fields:
DUPCOUNT
total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.
<user defined>
annotation fields specified by the –cf parameter.