.. _Filter: Filtering, Subsetting and Converting ================================================================================ Cleaning or removing poor quality sequences -------------------------------------------------------------------------------- Data sets can be cleaned using one or more invocations of :ref:`FilterSeq`, which provides multiple sequence quality control operations. Four subcommands remove sequences from the data that fail to meet some threshold: including length, (:program:`length`), number of N or gap characters (:program:`missing`), homopolymeric tract length (:program:`repeats`), or mean Phred quality score (:program:`quality`). Two subcommands modify sequences without removing them: :program:`trimqual` truncates the sequences when the mean Phred quality scores decays under a threshold, and :program:`maskqual` replaces positions with low Phred quality scores with N characters. :ref:`FilterSeq` provides the following quality control subcommands: ============ ================= Subcommand Operation ============ ================= length Removes short sequences missing Removes sequences with too many Ns or gaps repeats Removes sequences with long homopolymeric tracts quality Removes sequences with low mean quality scores trimqual Truncates sequences where quality scores decay maskqual Masks low quality positions ============ ================= Subsetting sequence files by annotation -------------------------------------------------------------------------------- The :program:`group` subcommand of :ref:`SplitSeq` allows you to split one file into multiple files based on the values in a sequence annotation. For example, splitting one file with multiple ``SAMPLE`` annotations into separate files (one for each sample) would be accomplished by:: SplitSeq.py group -s reads.fastq -f SAMPLE which will create a set of files labelled ``SAMPLE-M1`` and ``SAMPLE-M2`` if samples are named ``M1`` and ``M2``. If you wanted to split based on a numeric value, rather than a set of categorical values, then you would add the :option:`--num ` argument. :ref:`SplitSeq` would then create two files: one containing sequences with values less than the threshold specified by the :option:`--num ` argument and one file containing sequences with values greater than or equal to the threshold:: SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2 which will create two files with the labels ``atleast-2`` and ``under-2``. .. _Filter-RandomSampling: Random sampling from sequence files -------------------------------------------------------------------------------- The :program:`sample` subcommand of :ref:`SplitSeq` may be used to generate a random sample from a sequence file or set of pair-end files. The example below will select a random sample of 1,000 sequences (:option:`-n 1000 `) which all contain the annotation ``SAMPLE=M1`` (:option:`-f SAMPLE ` and :option:`-u M1 `):: SplitSeq.py sample -s reads.fastq -f SAMPLE -u M1 -n 1000 Performing an analogous sampling of Illumina paired-end reads would be accomplished using the :program:`samplepair` subcommand:: SplitSeq.py samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina .. note:: Both the :option:`-f ` and :option:`-n ` arguments will accept a list of values (eg, ``-n 1000 100 10``), allowing you to sample multiple times from multiple files in one command. Converting to FASTA for IMGT/HighV-QUEST or IgBLAST -------------------------------------------------------------------------------- `IMGT/HighV-QUEST `__ and `IgBLAST `__ both require sequences in FASTA format. You can use the Immcantation script `fastq2fasta.py` to convert `.fastq` to `.fasta`. The script is available from the `repository `__ and is pre-installed in the Docker container. .. code-block:: none fastq2fasta.py reads.fastq Alternatively, you can request :ref:`SplitSeq.py` to output FASTA files by using the flag :option:`--fasta`. In the example workflows, a common last step in the data processing pipelines is filtering sequences with at least two representative reads. By adding :option:`--fasta` to the command, the output file will be a `.fasta` file. .. code-block:: none SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2 --fasta Reducing file size for submission to IMGT/HighV-QUEST -------------------------------------------------------------------------------- `IMGT/HighV-QUEST `__ currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the :program:`count` subcommand of :ref:`SplitSeq` to divide your files into small pieces. .. code-block:: none SplitSeq.py count -s reads.fastq -n 500000 --fasta The :option:`-n 500000 ` argument sets the maximum number of sequences in each file and the :option:`--fasta ` tells the tool to output a FASTA, rather than FASTQ, formatted file. .. note:: You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using the :ref:`CollapseSeq` tool.