ClusterSets.py
Cluster sequences by group
usage: ClusterSets.py [--version] [-h] ...
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- output files:
- cluster-pass
clustered reads.
- cluster-fail
raw reads failing clustering.
- output annotation fields:
- CLUSTER
a numeric cluster identifier defining the within-group cluster.
ClusterSets.py all
Cluster all sequences regardless of annotation.
usage: ClusterSets.py all [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
[--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
[--cluster {usearch,vsearch,cd-hit-est}]
[--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
[--start SEQ_START] [--end SEQ_END]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- -o <out_files>
Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- --nproc <nproc>
The number of simultaneous computational processes to execute (CPU cores to utilized).
- -k <cluster_field>
The name of the output annotation field to add with the cluster information for each sequence.
- --ident <ident>
The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.
- --length <length_ratio>
The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.
- --prefix <cluster_prefix>
A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.
- --cluster {usearch,vsearch,cd-hit-est}
The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.
- --mem <cluster_memory>
The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.
- --exec <cluster_exec>
The name or path of the usearch, vsearch or cd-hit-est executable.
- --start <seq_start>
The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.
- --end <seq_end>
The end of the region to be used for clustering.
ClusterSets.py barcode
Cluster reads by clustering barcode sequences.
usage: ClusterSets.py barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--fasta]
[--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-k CLUSTER_FIELD]
[--ident IDENT] [--length LENGTH_RATIO]
[--prefix CLUSTER_PREFIX]
[--cluster {usearch,vsearch,cd-hit-est}]
[--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
[-f BARCODE_FIELD]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- -o <out_files>
Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- --nproc <nproc>
The number of simultaneous computational processes to execute (CPU cores to utilized).
- -k <cluster_field>
The name of the output annotation field to add with the cluster information for each sequence.
- --ident <ident>
The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.
- --length <length_ratio>
The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.
- --prefix <cluster_prefix>
A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.
- --cluster {usearch,vsearch,cd-hit-est}
The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.
- --mem <cluster_memory>
The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.
- --exec <cluster_exec>
The name or path of the usearch, vsearch or cd-hit-est executable.
- -f <barcode_field>
The annotation field containing barcode sequences.
ClusterSets.py set
Cluster sequences within annotation sets.
usage: ClusterSets.py set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--fasta] [--delim DELIMITER DELIMITER DELIMITER]
[--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
[--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
[--cluster {usearch,vsearch,cd-hit-est}]
[--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
[-f SET_FIELD] [--start SEQ_START] [--end SEQ_END]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- -o <out_files>
Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --log <log_file>
Specify to write verbose logging to a file. May not be specified with multiple input files.
- --failed
If specified create files containing records that fail processing.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- --nproc <nproc>
The number of simultaneous computational processes to execute (CPU cores to utilized).
- -k <cluster_field>
The name of the output annotation field to add with the cluster information for each sequence.
- --ident <ident>
The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.
- --length <length_ratio>
The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.
- --prefix <cluster_prefix>
A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.
- --cluster {usearch,vsearch,cd-hit-est}
The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.
- --mem <cluster_memory>
The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.
- --exec <cluster_exec>
The name or path of the usearch, vsearch or cd-hit-est executable.
- -f <set_field>
The annotation field containing annotations, such as UMI barcode, for sequence grouping.
- --start <seq_start>
The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.
- --end <seq_end>
The end of the region to be used for clustering.