BAM format additions for PacBio-subreads

PacBio-subread BAM flavors

Data generated by the PacBio basecaller is stored in subreads and scraps BAM files. Those files are consumed by CCS to generate HiFi reads. For PacBio in-house analysis, those files can be used to measure and characterize base calling performance or develop new methods for HiFi generation. Those use cases require extra information to be carried in our BAM files.

The subreads and scraps files are fully compliant with the PacBio BAM spec (with spec version noted in the @HD::pb tag) but will include additional per-read tags containing additional information.

QNAME convention

By convention the QNAME (“query template name”) for unrolled reads and subreads is in the following format:

{movieName}/{holeNumber}/{qStart}_{qEnd}

where [qStart, qEnd) is the 0-based coordinate interval representing the span of the query in the ZMW read, as above.

BAM filename conventions

Since we will be using BAM format for different kinds of data, we will use a suffix.bam filename convention:

Data type

Filename template

ZMW reads from movie

movieName.zmws.bam

Analysis-ready subreads 1

from movie

movieName.subreads.bam

Excised adapters, barcodes, and

rejected subreads

movieName.scraps.bam

Aligned subreads in a job

jobID.aligned_subreads.bam

1

Data in a subreads.bam file should be analysis ready, meaning that all of the data present is expected to be useful for down-stream analyses. Any subreads for which we have strong evidence will not be useful (e.g. double-adapter inserts, single-molecule artifacts) should be excluded from this file and placed in scraps.bam as a Filtered with an SC tag of F.

Use of headers for file-level information

Beyond the usual information encoded in headers that is called for SAM/BAM spec and what is added for customer-facing PacBio BAM files, we encode special information as follows.

@RG (read group) header entries:

DS tag (“description”):

contains some semantic information about the reads in the group, encoded as a semicolon-delimited list of “Key=Value” strings, as follows:

Base feature manifest—absent item means feature absent from reads:

Key

Value spec

Value example

DeletionQV

Name of tag used for DeletionQV

dq

DeletionTag

Name of tag used for DeletionTag

dt

InsertionQV

Name of tag used for InsertionQV

iq

MergeQV

Name of tag used for MergeQV

mq

SubstitutionQV

Name of tag used for SubstitutionQV

sq

SubstitutionTag

Name of tag used for SubstitutionTag

st

Use of read tags for per-read information

Tag

Type

Description

ws

i

Start of first base of the query (‘qs’) in approximate raw frame count since start of movie.

we

i

Start of last base of the query (‘qe - 1’) in approximate raw frame count since start of movie.

Use of read tags for per-read-base information

The following read tags encode features measured/calculated per-basecall. Unlike SEQ and QUAL, aligners will not orient these tags. They will be maintained in native orientation (in the same order and sense as collected from the instrument) even if the read record has been aligned to the reverse strand.

Tag

Type

Description

dq

Z

DeletionQV

dt

Z

DeletionTag

ip

B,C or B,S

IPD (raw frames or codec V1)

iq

Z

InsertionQV

mq

Z

MergeQV

pw

B,C or B,S

PulseWidth (raw frames or codec V1)

sq

Z

SubstitutionQV

st

Z

SubstitutionTag

Notes:

  • QV metrics are ASCII+33 encoded as strings

  • DeletionTag and SubstitutionTag represent alternate basecalls, or “N” when there is no alternate basecall available. In other words, they are strings over the alphabet “ACGTN”.

How to annotate scrap reads

Reads that belong to a read group with READTYPE=SCRAP have to be annotated in a hierarchical fashion:

  1. Classification with tag sz occurs on a per ZMW level, distinguishing between spike-in controls, sentinels of the basecaller, malformed ZMWs, and user-defined templates.

  2. A region-wise annotation with tag sc to label adapters, barcodes, low-quality regions, and filtered subreads.

Tag

Type

Description

sz

A

ZMW classification annotation, one of N:=Normal, C:=Control, M:=Malformed, or S:=Sentinel 1

sc

A

Scrap region-type annotation, one of A:=Adapter, B:=Barcode, L:=LQRegion, or F:=Filtered 2

1

reads in the subreads/hqregions/zmws.bam file are implicitly marked as Normal, as they stem from user-defined templates.

2

sc tags ‘A’, ‘B’, and ‘L’ denote specific classes of non-subread data, whereas the ‘F’ tag is reserved for subreads that are undesirable for downstream analysis, e.g., being artifactual or too short.

Subread local context

Some algorithms can make use of knowledge that a subread was flanked on both sides by adapter or barcode hits, or that the subread was in one orientation or the other (as can be deduced when asymmetric adapters or barcodes are used).

To facilitate such algorithms, we furnish the cx bitmask tag for subread records. The cx value is calculated by binary OR-ing together values from this flags enum:

enum LocalContextFlags
{
    ADAPTER_BEFORE     = 1,
    ADAPTER_AFTER      = 2,
    BARCODE_BEFORE     = 4,
    BARCODE_AFTER      = 8,
    FORWARD_PASS       = 16,
    REVERSE_PASS       = 32,
    ADAPTER_BEFORE_BAD = 64,
    ADAPTER_AFTER_BAD  = 128
};

Orientation of a subread (designated by one of the mutually exclusive FORWARD_PASS or REVERSE_PASS bits) can be reckoned only if either the adapters or barcode design is asymmetric, otherwise these flags must be left unset. The convention for what is considered a “forward” or “reverse” pass is determined by a per-ZMW convention, defining one element of the asymmetric barcode/adapter pair as the “front” and the other as the “back”. It is up to tools producing the BAM to determine whether to use adapters or barcodes to reckon the orientation, but if pass directions cannot be confidently and consistently assessed for the subreads from a ZMW, neither orientation flag should be set. Tools consuming the BAM should be aware that orientation information may be unavailable for subreads in a ZMW, but if is available for any subread in the ZMW, it will be available for all subreads in the ZMW.

The ADAPTER_* and BARCODE_* flags reflect whether the subread is flanked by adapters or barcodes at the ends.

The ADAPTER_BEFORE_BAD and ADAPTER_AFTER_BAD flags indicate that one or both adapters flanking this subread do not align to the adapter reference sequence(s). The adapter on this flank could be missing from the pbell molecule, or obscured by a local decrease in accuracy. Likewise, some nearby barcode or insert bases may be missing or obscured. ADAPTER_*_BAD flags can not be set unless the corresponding ADAPTER_* flag is set.

This tag is mandatory for subread records, but will be absent from non-subread records (scraps, ZMW read, CCS read, etc.)

Tag

Type

Description

cx

i

Subread local context Flags