ClusterSets

Cluster sequences by group

usage: ClusterSets [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
cluster-pass
clustered reads.
cluster-fail
raw reads failing clustering.
output annotation fields:
CLUSTER
a numeric cluster identifier defining the within-group cluster.

ClusterSets all

Cluster all sequences regardless of annotation.

usage: ClusterSets all [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--fasta]
                       [--delim DELIMITER DELIMITER DELIMITER] [--nproc NPROC]
                       [-k CLUSTER_FIELD] [--ident IDENT]
                       [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                       [--cluster {usearch,vsearch,cd-hit-est}]
                       [--exec CLUSTER_EXEC] [--start SEQ_START]
                       [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.

ClusterSets barcode

Cluster reads by clustering barcode sequences.

usage: ClusterSets barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                           [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                           [--cluster {usearch,vsearch,cd-hit-est}]
                           [--exec CLUSTER_EXEC] [-f BARCODE_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <barcode_field>

The annotation field containing barcode sequences.

ClusterSets set

Cluster sequences within annotation sets.

usage: ClusterSets set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                       [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                       [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                       [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                       [--cluster {usearch,vsearch,cd-hit-est}]
                       [--exec CLUSTER_EXEC] [-f SET_FIELD]
                       [--start SEQ_START] [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <set_field>

The annotation field containing annotations, such as UMI barcode, for sequence grouping.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.