SplitSeq.py
Sorts, samples and splits FASTA/FASTQ sequence files
usage: SplitSeq.py [--version] [-h] ...
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- output files:
- part<part>
reads partitioned by count, where <part> is the partition number.
- <field>-<value>
reads partitioned by annotation <field> and <value>.
- under-<number>
reads partitioned by numeric threshold where the annotation value is strictly less than the threshold <number>.
- atleast-<number>
reads partitioned by numeric threshold where the annotation value is greater than or equal to the threshold <number>.
- sorted
reads sorted by annotation value.
- sorted-part<part>
reads sorted by annotation value and partitioned by count, where <part> is the partition number.
- sample<i>-n<count>
randomly sampled reads where <i> is a number specifying the sampling instance and <count> is the number of sampled reads.
- selected
reads passing selection criteria.
- output annotation fields:
None
SplitSeq.py count
Splits sequences files by number of records.
usage: SplitSeq.py count [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta] -n
MAX_COUNT
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- -n <max_count>
Maximum number of sequences in each new file
SplitSeq.py group
Splits sequences files by annotation.
usage: SplitSeq.py group [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELD
[--num THRESHOLD]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- -f <field>
Annotation field to split sequence files by
- --num <threshold>
Specify to define the split field as numeric and group sequences by value.
SplitSeq.py sample
Randomly samples from unpaired sequences files.
usage: SplitSeq.py sample [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -n MAX_COUNT
[MAX_COUNT ...] [-f FIELD] [-u VALUES [VALUES ...]]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- -n <max_count>
Maximum number of sequences to sample from each file, field or annotation set. The default behavior, without the -f argument, is to sample from the complete set of sequences in the input file.
- -f <field>
The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.
- -u <values>
If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.
SplitSeq.py samplepair
Randomly samples from paired-end sequences files.
usage: SplitSeq.py samplepair [--version] [-h] -1 SEQ_FILES_1
[SEQ_FILES_1 ...] -2 SEQ_FILES_2
[SEQ_FILES_2 ...] [--outdir OUT_DIR]
[--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -n
MAX_COUNT [MAX_COUNT ...] [-f FIELD]
[-u VALUES [VALUES ...]]
[--coord {illumina,solexa,sra,454,presto}]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -1 <seq_files_1>
An ordered list of FASTA/FASTQ files containing head/primary sequences.
- -2 <seq_files_2>
An ordered list of FASTA/FASTQ files containing tail/secondary sequences.
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- -n <max_count>
Maximum number of paired sequences to sample from each set of files, fields or annotations. The default behavior, without the -f argument, is to sample from the complete set of paired sequences in the input files.
- -f <field>
The annotation field for sampling criteria. If the -u argument is not also specified, then sampling will be performed for each unique annotation value in the declared field separately.
- -u <values>
If specified, sampling will be restricted to sequences that contain one of the declared annotation values in the specified field. Requires the -f argument.
- --coord {illumina,solexa,sra,454,presto}
The format of the sequence identifier which defines shared coordinate information across paired read files.
SplitSeq.py select
Selects sequences from sequence files by annotation.
usage: SplitSeq.py select [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELD
[-u VALUE_LIST [VALUE_LIST ...] | -t VALUE_FILE]
[--not]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- -o <out_files>
Explicit output file name(s). Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- -f <field>
The annotation field for selection criteria.
- -u <value_list>
A list of values to select for in the specified field. Mutually exclusive with -t.
- -t <value_file>
A tab delimited file specifying values to select for in the specified field. The file must be formatted with the given field name in the header row. Values will be taken from that column. Mutually exclusive with -u.
- --not
If specified, will perform negative matching. Meaning, sequences will be selected if they fail to match for all specified values.
SplitSeq.py sort
Sorts sequences files by annotation.
usage: SplitSeq.py sort [--version] [-h] -s SEQ_FILES [SEQ_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--fasta]
[--delim DELIMITER DELIMITER DELIMITER] -f FIELD
[-n MAX_COUNT] [--num]
- --version
show program’s version number and exit
- -h, --help
show this help message and exit
- -s <seq_files>
A list of FASTA/FASTQ files containing sequences to process.
- --outdir <out_dir>
Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
- --outname <out_name>
Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
- --fasta
Specify to force output as FASTA rather than FASTQ.
- --delim <delimiter>
A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively.
- -f <field>
The annotation field to sort sequences by.
- -n <max_count>
Maximum number of sequences in each new file.
- --num
Specify to define the sort field as numeric rather than textual.