Miscellaneous Tasks

Importing data from SRA, ENA or GenBank into pRESTO

If you have download a data set from GenBank, SRA or ENA the format of the sequences headers are different from the raw Roche 454 and Illumina header format. As such, they may or may not be compatible with pRESTO, depending on how the headers have been modified by the sequence archive. The ConvertHeaders allow you to change incompatible header formats into the pRESTO format. For example, to convert from SRA or ENA headers the sra subcommand would be used:

ConvertHeaders.py sra -s reads.fastq

ConvertHeaders provides the following conversion subcommands:

Subcommand Formats Converted
generic Headers with an unknown annotation system
454 Roche 454
genbank NCBI GenBank and RefSeq
illumina Illumina HiSeq or MiSeq
imgt IMGT/GENE-DB
sra NCBI SRA or EMBL-EBI ENA

Reducing file size for submission to IMGT/HighV-QUEST

IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of SplitSeq to divide your files into small pieces.

SplitSeq.py count -s reads.fastq -n 500000 --fasta

The -n 500000 argument sets the maximum number of sequences in each file and the --fasta tells the tool to output a FASTA, rather than FASTQ, formatted file.

Note

You can usually avoid the necessity of reducing file sizes by removing duplicate sequences first using the CollapseSeq tool.

Subsetting sequence files by annotation

The group subcommand of SplitSeq allows you to split one file into multiple files based on the values in a sequence annotation. For example, splitting one file with multiple SAMPLE annotations into separate files (one for each sample) would be accomplished by:

SplitSeq.py group -s reads.fastq -f SAMPLE

Which will create a set of files labelled SAMPLE-M1 and SAMPLE-M2, if samples are named M1 and M2.

If you wanted to split based on a numeric value, rather than a set of categorical values, then you would add the --num argument. SplitSeq would then create two files: one containing sequences with values less than the threshold specified by the --num argument and one file containing sequences with values greater than or equal to the threshold:

SplitSeq.py group -s reads.fastq -f DUPCOUNT --num 2

Which will create two files with the labels atleast-2 and under-2.

Random sampling from sequence files

The sample subcommand of SplitSeq may be used to generate a random sample from a sequence file or set of pair-end files. The example below will select a random sample of 1,000 sequences (-n 1000) which all contain the annotation SAMPLE=M1 (-f SAMPLE and -u M1):

SplitSeq.py sample -s reads.fastq -f SAMPLE -u M1 -n 1000

Performing an analogous sampling of Illumina paired-end reads would be accomplished using the samplepair subcommand:

SplitSeq.py samplepair -s reads.fastq -f SAMPLE -u M1 -n 1000 --coord illumina

Note

Both the -f and -n arguments will accept a list of values (eg, -n 1000 100 10), allowing you to sample multiple times from multiple files in one command.

Cleaning or removing poor quality sequences

Data sets can be cleaned using one or more invocations of FilterSeq, which provides multiple sequence quality control operations. Four subcommands remove sequences from the data that fail to meet some threshold: including length, (length), number of N or gap characters (missing), homopolymeric tract length (repeats), or mean Phred quality score (quality). Two subcommands modify sequences without removing them: trimqual truncates the sequences when the mean Phred quality scores decays under a threshold, and maskqual replaces positions with low Phred quality scores with N characters.

FilterSeq provides the following quality control subcommands:

Subcommand Operation
length Removes short sequences
missing Removes sequences with too many Ns or gaps
repeats Removes sequences with long homopolymeric tracts
quality Removes sequences with low mean quality scores
trimqual Truncates sequences where quality scores decay
maskqual Masks low quality positions

Assembling paired-end reads that do not overlap

The typical way to assemble paired-end reads is via de novo assembly using the align subcommand of AssemblePairs. However, some sequences with long CDR3 regions may fail to assemble due to insufficient, or completely absent, overlap between the mate-pairs. The reference or sequential subcommands can be used to assemble mate-pairs that do not overlap using the ungapped V-segment references sequences as a guide.

To handle such sequence in two separate steps, a normal align command would be performed first. The --failed argument is added so that the reads failing de novo alignment are output to separate files:

AssemblePairs.py align -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina --failed -outname align

Then, the files labeled assemble-fail, along with the ungapped V-segment reference sequences (-r vref.fasta), would be input into the reference subcommand of AssemblePairs:

AssemblePairs.py reference -1 align-1_assemble-fail.fastq -2 align-2_assemble-fail.fastq \
     --rc tail -r vref.fasta --coord illumina --outname ref

This will result in two separate assemble-pass files - one from each step. You may process them separately or concatenate them together into a single file:

cat align_assemble-pass.fastq ref_assemble-pass.fastq > merged_assemble-pass.fastq

However, if you intend to processes them together, you may simplify this by perform both steps using the sequential subcommand, which will attempt de novo assembly followed by reference guided assembly if de novo assembly fails:

AssemblePairs.py sequential -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina -r vref.fasta

Note

The sequences output by the reference or sequential subcommands may contain an appropriate length spacer of Ns between any mate-pairs that do not overlap. The :option:–fill <AssemblePairs reference –fill>`` argument may be specified to force AssemblePairs to insert the germline sequence into the missing positions, but this should be used with caution as the inserted sequence may not be biologically correct.

Assigning isotype annotations from the constant region sequence

MaskPrimers is usually used to remove primer regions and annotate sequences with primer identifiers. However, it can be used for any other case where you need to align a set of short sequences against the reads. One example alternate use is where you either do not know the C-region primer sequences or do not trust the primer region to provide an accurate isotype assignment.

If you build a FASTA file containing the reverse-complement of short sequences from the front of CH-1, then you can annotate the reads with these sequence in the same way you would C-region specific primers:

MaskPrimers.py align -s reads.fastq -p IGHC.fasta --maxlen 100 --maxerror 0.3 \
    --mode cut --revpr --pf C_CALL

Where --revpr tells MaskPrimers to reverse-complement the “primer” sequences and look for them at the end of the reads, --maxlen 100 restricts the search to the last 100 bp, --maxerror 0.3 allows for up to 30% mismatches, and -p IGHC.fasta specifies the file containing the CH-1 sequences. The name of the C-region will be added to the sequence headers as the C_CALL annotation, where the field name is specified by the --pf argument. An example CH-1 sequence file would look like:

>IGHD
CTGATATGATGGGGAACACATCCGGAGCCTTGGTGGGTGC
>IGHM
AGGAGACGAGGGGGAAAAGGGTTGGGGCGGATGCACTCCC
>IGHG
AGGGYGCCAGGGGGAAGACSGATGGGCCCTTGGTGGAAGC
>IGHA
MGAGGCTCAGCGGGAAGACCTTGGGGCTGGTCGGGGATGC
>IGHE
AGCGGGTCAAGGGGAAGACGGATGGGCTCTGTGTGGAGGC

Download IGHC.fasta

See also

Constant region reference sequences may be downloaded from IMGT and the sequence headers can be reformated using the imgt subcommand of ConvertHeaders. Note, you may need to clean-up the reference sequences a bit before running ConvertHeaders if you receive an error about duplicate sequence names (eg, remove duplicate allele with different artificial splicing). To cut and reverse-complement the constant region sequences use something like seqmagick.

Estimating sequencing and PCR error rates with UMI data

The EstimateError tool provides methods for estimating the combined PCR and sequencing error rates from large UMI read groups. The assumptions being, that consensus sequences generated from sufficiently large UMI read groups should be accurate representations of the true sequences, and that the rate of mismatches from consensus should therefore be an accurate estimate of the error rate in the data. However, this is not guaranteed to be true, hence this approach can only be considered an estimate of a data set’s error profile. The following command generates an error profile from UMI read groups with 50 or more sequences (-n 50), using a majority rule consensus sequence (--mode freq), and excluding UMI read groups with high nucleotide diversity (--maxdiv 0.1):

EstimateError.py -s reads.fastq -n 50 --mode freq --maxdiv 0.1

This generates the following tab-delimited files containing error rates broken down by various criteria:

File Error profile
reads_error-position.tab Error rates by read position
reads_error-quality.tab Error rates by quality score
reads_error-nucleotide.tab Error rates by nucleotide identity
reads_error-set.tab Error rates by UMI read group size