Filtering, Subsetting and Converting

Cleaning or removing poor quality sequences

Data sets can be cleaned using one or more invocations of, which provides multiple sequence quality control operations. Four subcommands remove sequences from the data that fail to meet some threshold: including length, (length), number of N or gap characters (missing), homopolymeric tract length (repeats), or mean Phred quality score (quality). Two subcommands modify sequences without removing them: trimqual truncates the sequences when the mean Phred quality scores decays under a threshold, and maskqual replaces positions with low Phred quality scores with N characters. provides the following quality control subcommands:




Removes short sequences


Removes sequences with too many Ns or gaps


Removes sequences with long homopolymeric tracts


Removes sequences with low mean quality scores


Truncates sequences where quality scores decay


Masks low quality positions

Subsetting sequence files by annotation

The group subcommand of allows you to split one file into multiple files based on the values in a sequence annotation. For example, splitting one file with multiple SAMPLE annotations into separate files (one for each sample) would be accomplished by: group -s reads.fastq -f SAMPLE

which will create a set of files labelled SAMPLE-M1 and SAMPLE-M2 if samples are named M1 and M2.

If you wanted to split based on a numeric value, rather than a set of categorical values, then you would add the --num argument. would then create two files: one containing sequences with values less than the threshold specified by the --num argument and one file containing sequences with values greater than or equal to the threshold: group -s reads.fastq -f DUPCOUNT --num 2

which will create two files with the labels atleast-2 and under-2.

Random sampling from sequence files

The sample subcommand of may be used to generate a random sample from a sequence file or set of pair-end files. The example below will select a random sample of 1,000 sequences (-n 1000) which all contain the annotation SAMPLE=M1 (-f SAMPLE and -u M1): sample -s reads.fastq -f SAMPLE -u M1 -n 1000

Performing an analogous sampling of Illumina paired-end reads would be accomplished using the samplepair subcommand: samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina


Both the -f and -n arguments will accept a list of values (eg, -n 1000 100 10), allowing you to sample multiple times from multiple files in one command.

Converting to FASTA for IMGT/HighV-QUEST or IgBLAST

IMGT/HighV-QUEST and IgBLAST both require sequences in FASTA format. You can use the Immcantation script to convert .fastq to .fasta. The script is available from the repository and is pre-installed in the Docker container. reads.fastq

Alternatively, you can request to output FASTA files by using the flag --fasta. In the example workflows, a common last step in the data processing pipelines is filtering sequences with at least two representative reads. By adding --fasta to the command, the output file will be a .fasta file. group -s reads.fastq -f DUPCOUNT --num 2 --fasta

Reducing file size for submission to IMGT/HighV-QUEST

IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of to divide your files into small pieces. count -s reads.fastq -n 500000 --fasta

The -n 500000 argument sets the maximum number of sequences in each file and the --fasta tells the tool to output a FASTA, rather than FASTQ, formatted file.


You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using the tool.