CollapseSeq.py
Removes duplicate sequences from FASTA/FASTQ files
usage: CollapseSeq.py [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[-n MAX_MISSING] [--uf UNIQ_FIELDS [UNIQ_FIELDS ...]]
[--cf COPY_FIELDS [COPY_FIELDS ...]]
[--act {min,max,sum,set} [{min,max,sum,set} ...]]
[--inner] [--keepmiss]
[--maxf MAX_FIELD | --minf MIN_FIELD]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- -o <out_files>
Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --log <log_file>
Specify to write verbose logging to a file. May not be specified with multiple input files.
- --failed
If specified create files containing records that fail processing.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- -n <max_missing>
Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides. Note, setting a value above 0 will consider ambiguous/missing nucleotides via a distance calculation, but is considerably more computationally expensive, especially on large data sets.
- --uf <uniq_fields>
Specifies a set of annotation fields that must match for sequences to be considered duplicates.
- --cf <copy_fields>
Specifies a set of annotation fields to copy into the unique sequence output.
- --act {min,max,sum,set}
List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions “min”, “max”, “sum” perform the corresponding mathematical operation on numeric annotations. The action “set” collapses annotations into a comma delimited list of unique values.
- --inner
If specified, exclude consecutive missing characters at either end of the sequence.
- --keepmiss
If specified, sequences with more missing characters than the threshold set by the -n parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file.
- --maxf <max_field>
Specify the field whose maximum value determines the retained sequence; mutually exclusive with –minf.
- --minf <min_field>
Specify the field whose minimum value determines the retained sequence; mutually exclusive with –minf.
- output files:
- collapse-unique
unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria.
- collapse-duplicate
raw reads which are duplicates of the sequences retained in the collapse-unique file.
- collapse-undetermined
raw reads which were excluded from consideration due to having too many N characters in the sequence.
- output annotation fields:
- DUPCOUNT
total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file.
- <user defined>
annotation fields specified by the –cf parameter.