Manipulating Annotations

The ParseHeaders.py tool provides a collection of methods for performing simple manipulations of sequence headers that are formatted in the pRESTO annotation scheme.

For converting sequence headers into the pRESTO format, see the Importing Data documentation.

Adding a sample annotation

Addition of annotation values is accomplished using the add subcommand of ParseHeaders.py:

ParseHeaders.py add -s reads.fastq -f SAMPLE -u A1

which will add the annotation SAMPLE=A1 to each sequence of the input file.

Expanding and renaming annotations

By default, pRESTO will not delete annotations. If a sequence header already contains an annotation that a tool is trying to add, it will not overwrite that annotation. Instead, it will append the annotation value to the values already present in a comma delimited form. For example, after two interations of MaskPrimers.py with the default primer field name PRIMER, you will have an annotation in the following form (reflecting a match against primer VH3 in the first iteration and primer IGHM in the second):

PRIMER=VH3,IGHM

Separating these annotations into two annotations is accomplished via the expand subcommand of ParseHeaders.py:

ParseHeaders.py expand -s reads.fastq -f PRIMER

Resulting in the annotations:

PRIMER1=VH3|PRIMER2=IGHM

which may then be renamed via the rename subcommand: expand subcommand of ParseHeaders.py:

ParseHeaders.py rename -s reads_reheader.fastq -f PRIMER1 PRIMER2 \
    -k VPRIMER CPRIMER

Copying, merging and collapsing annotations

Nested annotations can be generated using the copy or merge subcommands of ParseHeaders.py. The examples that follow will use the starting annotation:

UMI=ATGC|CELL=GGCC|COUNT=10,2

The UMI and CELL annotations can be combined into a single INDEX annotation using the following command:

ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete
# result> COUNT=10,2|INDEX=ATGC,GGCC

Without the --delete argument, the original UMI and CELL annotations would be kept in the header.

The nested annotation values can then be combined using the collapse subcommand to create various effects:

ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act cat
# result> INDEX=ATGCGGCC

ParseHeaders.py collapse -s reads_reheader.fasta -f INDEX --act first
# result> INDEX=ATGC

ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act sum
# result> COUNT=12

ParseHeaders.py collapse -s reads_reheader.fasta -f COUNT --act min
# result> COUNT=2

where the --act argument specifies the type of collapse action to perform.

The copy subcommand is normally used to create duplicate annotations with different names, but will have a similar effect to the merge subcommand when the target is an existing field:

ParseHeaders.py copy -s reads.fasta -f UMI -k CELL
# result> UMI=ATGC|CELL=GGCC,ATGC|COUNT=10,2

Both the copy and merge subcommands have an --act argument which allows you to perform an action from the collapse subcommand in the same step as the copy or merge:

ParseHeaders.py merge -s reads.fasta -f UMI CELL -k INDEX --delete --act cat
# result> COUNT=10,2|INDEX=ATGCGGCC

ParseHeaders.py copy -s reads.fasta -f UMI -k CELL --act cat
# result> UMI=ATGC|CELL=GGCCATGC|COUNT=10,2

Deleting annotations

Unwanted annotations can be deleted using the delete subcommand of ParseHeaders.py:

ParseHeaders.py delete -s reads.fastq -f PRIMER

which will remove the PRIMER field from each sequence header.