ClusterSets¶

Cluster sequences by group

usage: ClusterSets [--version] [-h]  ...

--version¶: show program’s version number and exit

-h, --help¶: show this help message and exit

output files:

cluster-pass: clustered reads.
cluster-fail: raw reads failing clustering.

output annotation fields:

CLUSTER: a numeric cluster identifier defining the within-group cluster.

ClusterSets all¶

Cluster all sequences regardless of annotation.

usage: ClusterSets all [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--fasta]
                       [--delim DELIMITER DELIMITER DELIMITER] [--nproc NPROC]
                       [-k CLUSTER_FIELD] [--ident IDENT]
                       [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                       [--cluster {usearch,vsearch,cd-hit-est}]
                       [--exec CLUSTER_EXEC] [--start SEQ_START]
                       [--end SEQ_END]

--version¶: show program’s version number and exit

-h, --help¶: show this help message and exit

-s <seq_files>¶: A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>¶: Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>¶: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>¶: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta¶: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>¶: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>¶: The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>¶: The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>¶: The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>¶: The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>¶: A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}¶: The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>¶: The name or path of the usearch, vsearch or cd-hit-est executable.

--start <seq_start>¶: The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>¶: The end of the region to be used for clustering.

ClusterSets barcode¶

Cluster reads by clustering barcode sequences.

usage: ClusterSets barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                           [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                           [--outname OUT_NAME] [--fasta]
                           [--delim DELIMITER DELIMITER DELIMITER]
                           [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                           [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                           [--cluster {usearch,vsearch,cd-hit-est}]
                           [--exec CLUSTER_EXEC] [-f BARCODE_FIELD]

--version¶: show program’s version number and exit

-h, --help¶: show this help message and exit

-s <seq_files>¶: A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>¶: Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>¶: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>¶: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta¶: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>¶: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>¶: The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>¶: The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>¶: The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>¶: The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>¶: A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}¶: The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>¶: The name or path of the usearch, vsearch or cd-hit-est executable.

-f <barcode_field>¶: The annotation field containing barcode sequences.

ClusterSets set¶

Cluster sequences within annotation sets.

usage: ClusterSets set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                       [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                       [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                       [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                       [--cluster {usearch,vsearch,cd-hit-est}]
                       [--exec CLUSTER_EXEC] [-f SET_FIELD]
                       [--start SEQ_START] [--end SEQ_END]

--version¶: show program’s version number and exit

-h, --help¶: show this help message and exit

-s <seq_files>¶: A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>¶: Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>¶: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>¶: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>¶: Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed¶: If specified create files containing records that fail processing.

--fasta¶: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>¶: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>¶: The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>¶: The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>¶: The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>¶: The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>¶: A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}¶: The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the maximum memory limit is set to 3GB.

--exec <cluster_exec>¶: The name or path of the usearch, vsearch or cd-hit-est executable.

-f <set_field>¶: The annotation field containing annotations, such as UMI barcode, for sequence grouping.

--start <seq_start>¶: The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>¶: The end of the region to be used for clustering.