Fixing Assembly Problems

Assembling paired-end reads that do not overlap

The typical way to assemble paired-end reads is via de novo assembly using the align subcommand of AssemblePairs.py. However, some sequences with long CDR3 regions may fail to assemble due to insufficient or completely absent overlaps between the mate-pairs. The reference or sequential subcommands can be used to assemble mate-pairs that do not overlap using the ungapped V-segment references sequences as a guide.

To handle such sequences in two separate steps, a normal align command would be performed first. The --failed argument is added so that the reads failing de novo alignment are output to separate files:

AssemblePairs.py align -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina --failed -outname align

Then the files labeled assemble-fail, along with the ungapped V-segment reference sequences (-r vref.fasta), would be input into the reference subcommand of AssemblePairs.py:

AssemblePairs.py reference -1 align-1_assemble-fail.fastq -2 align-2_assemble-fail.fastq \
     --rc tail -r vref.fasta --coord illumina --outname ref

This will result in two separate assemble-pass files - one from each step. You may process them separately or concatenate them together into a single file:

cat align_assemble-pass.fastq ref_assemble-pass.fastq > merged_assemble-pass.fastq

However, if you intend to processes them together, you may simplify this by perform both steps using the sequential subcommand, which will attempt de novo assembly followed by reference guided assembly if de novo assembly fails:

AssemblePairs.py sequential -1 reads-1.fastq -2 reads-2.fastq --rc tail \
    --coord illumina -r vref.fasta

Note

The sequences output by the reference or sequential subcommands may contain an appropriate length spacer of Ns between any mate-pairs that do not overlap. The --fill argument can be specified to force AssemblePairs.py to insert the germline sequence into the missing positions, but this should be used with caution as the inserted sequence may not be biologically correct.