ClusterSets.py

Cluster sequences by group

usage: ClusterSets.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
cluster-pass

clustered reads.

cluster-fail

raw reads failing clustering.

output annotation fields:
CLUSTER

a numeric cluster identifier defining the within-group cluster.

ClusterSets.py all

Cluster all sequences regardless of annotation.

usage: ClusterSets.py all [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER]
                          [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                          [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                          [--cluster {usearch,vsearch,cd-hit-est}]
                          [--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
                          [--start SEQ_START] [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.

--mem <cluster_memory>

The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.

ClusterSets.py barcode

Cluster reads by clustering barcode sequences.

usage: ClusterSets.py barcode [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER]
                              [--nproc NPROC] [-k CLUSTER_FIELD]
                              [--ident IDENT] [--length LENGTH_RATIO]
                              [--prefix CLUSTER_PREFIX]
                              [--cluster {usearch,vsearch,cd-hit-est}]
                              [--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
                              [-f BARCODE_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.

--mem <cluster_memory>

The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <barcode_field>

The annotation field containing barcode sequences.

ClusterSets.py set

Cluster sequences within annotation sets.

usage: ClusterSets.py set [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                          [--fasta] [--delim DELIMITER DELIMITER DELIMITER]
                          [--nproc NPROC] [-k CLUSTER_FIELD] [--ident IDENT]
                          [--length LENGTH_RATIO] [--prefix CLUSTER_PREFIX]
                          [--cluster {usearch,vsearch,cd-hit-est}]
                          [--mem CLUSTER_MEMORY] [--exec CLUSTER_EXEC]
                          [-f SET_FIELD] [--start SEQ_START] [--end SEQ_END]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-s <seq_files>

A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>

Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--fasta

Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>

A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-k <cluster_field>

The name of the output annotation field to add with the cluster information for each sequence.

--ident <ident>

The sequence identity threshold to use for clustering. Note, how identity is calculated is specific to the clustering application used.

--length <length_ratio>

The minimum allowed shorter/longer sequence length ratio allowed within a cluster. Setting this value to 1.0 will require identical length matches within clusters. A value of 0.0 will allow clusters containing any length of substring.

--prefix <cluster_prefix>

A string to use as the prefix for each cluster identifier. By default, cluster identifiers will be numeric values only.

--cluster {usearch,vsearch,cd-hit-est}

The clustering tool to use for assigning clusters. Must be one of usearch, vsearch or cd-hit-est. Note, for cd-hit-est the default maximum memory limit is set to 3GB.

--mem <cluster_memory>

The maximum memory limit for cd-hit-est in MB. Ignored if using usearch or vsearch.

--exec <cluster_exec>

The name or path of the usearch, vsearch or cd-hit-est executable.

-f <set_field>

The annotation field containing annotations, such as UMI barcode, for sequence grouping.

--start <seq_start>

The start of the region to be used for clustering. Together with –end, this parameter can be used to specify a subsequence of each read to use in the clustering algorithm.

--end <seq_end>

The end of the region to be used for clustering.