SplitSeq.py

Sorts, samples and splits FASTA/FASTQ sequence files

usage: SplitSeq.py [--version] [-h]  ...

--version: show program’s version number and exit

-h, --help: show this help message and exit

output files:

part<part>: reads partitioned by count, where <part> is the partition number.
<field>-<value>: reads partitioned by annotation <field> and <value>.
under-<number>: reads partitioned by numeric threshold where the annotation value is strictly less than the threshold <number>.
atleast-<number>: reads partitioned by numeric threshold where the annotation value is greater than or equal to the threshold <number>.
sorted: reads sorted by annotation value.
sorted-part<part>: reads sorted by annotation value and partitioned by count, where <part> is the partition number.
sample<i>-n<count>: randomly sampled reads where <i> is a number specifying the sampling instance and <count> is the number of sampled reads.
selected: reads passing selection criteria.

output annotation fields:

None

SplitSeq.py count

Splits sequences files by number of records.

usage: SplitSeq.py count [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta] -n
                         MAX_COUNT

--version: show program’s version number and exit

-h, --help: show this help message and exit

-s <seq_files>: A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta: Specify to force output as FASTA rather than FASTQ.

-n <max_count>: Maximum number of sequences in each new file

SplitSeq.py group

Splits sequences files by annotation.

usage: SplitSeq.py group [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                         [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                         [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                         [--num THRESHOLD]

--version: show program’s version number and exit

-h, --help: show this help message and exit

-s <seq_files>: A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>: Annotation field to split sequence files by

--num <threshold>: Specify to define the split field as numeric and group sequences by value.

SplitSeq.py sample

Randomly samples from unpaired sequences files.

usage: SplitSeq.py sample [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER] -n MAX_COUNT
                          [MAX_COUNT ...] [-f FIELD] [-u VALUES [VALUES ...]]

--version: show program’s version number and exit

-h, --help: show this help message and exit

-s <seq_files>: A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_count>: Maximum number of sequences to sample from each file, field or annotation set. The default behavior, without the -f argument, is to sample from the complete set of sequences in the input file.

-f <field>: The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.

-u <values>: If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.

SplitSeq.py samplepair

Randomly samples from paired-end sequences files.

usage: SplitSeq.py samplepair [--version] [-h] -1 SEQ_FILES_1
                              [SEQ_FILES_1 ...] -2 SEQ_FILES_2
                              [SEQ_FILES_2 ...] [--outdir OUT_DIR]
                              [--outname OUT_NAME] [--fasta]
                              [--delim DELIMITER DELIMITER DELIMITER] -n
                              MAX_COUNT [MAX_COUNT ...] [-f FIELD]
                              [-u VALUES [VALUES ...]]
                              [--coord {illumina,solexa,sra,454,presto}]

--version: show program’s version number and exit

-h, --help: show this help message and exit

-1 <seq_files_1>: An ordered list of FASTA/FASTQ files containing head/primary sequences.

-2 <seq_files_2>: An ordered list of FASTA/FASTQ files containing tail/secondary sequences.

--outdir <out_dir>: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-n <max_count>: Maximum number of paired sequences to sample from each set of files, fields or annotations. The default behavior, without the -f argument, is to sample from the complete set of paired sequences in the input files.

-f <field>: The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.

-u <values>: If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.

--coord {illumina,solexa,sra,454,presto}: The format of the sequence identifier which defines shared coordinate information across paired read files.

SplitSeq.py select

Selects sequences from sequence files by annotation.

usage: SplitSeq.py select [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--fasta]
                          [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                          [-u VALUE_LIST [VALUE_LIST ...] | -t VALUE_FILE]
                          [--not]

--version: show program’s version number and exit

-h, --help: show this help message and exit

-s <seq_files>: A list of FASTA/FASTQ files containing sequences to process.

-o <out_files>: Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>: The annotation field for selection criteria.

-u <value_list>: A list of values to select for in the specified field. Mutually exclusive with -t.

-t <value_file>: A tab delimited file specifying values to select for in the specified field. The file must be formatted with the given field name in the header row. Values will be taken from that column. Mutually exclusive with -u.

--not: If specified, will perform negative matching. Meaning, sequences will be selected if they fail to match for all specified values.

SplitSeq.py sort

Sorts sequences files by annotation.

usage: SplitSeq.py sort [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
                        [--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
                        [--delim DELIMITER DELIMITER DELIMITER] -f FIELD
                        [-n MAX_COUNT] [--num]

--version: show program’s version number and exit

-h, --help: show this help message and exit

-s <seq_files>: A list of FASTA/FASTQ files containing sequences to process.

--outdir <out_dir>: Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>: Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--fasta: Specify to force output as FASTA rather than FASTQ.

--delim <delimiter>: A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.

-f <field>: The annotation field to sort sequences by.

-n <max_count>: Maximum number of sequences in each new file.

--num: Specify to define the sort field as numeric rather than textual.