BAM format additions for PacBio-internal analysis¶
PacBio-internal BAM flavors¶
Several PacBio-internal use cases require extra information to be carried in our BAM files. There is currently a single “internal” flavor of the BAM spec documented here.
The internal analysis files will be fully compliant with the PacBio BAM spec
(with spec version noted in the @HD::pb
tag), are an extension of the
subreads BAM spec, but will include additional per-read tags containing
additional information.
Kinetic information¶
The internal BAM format requires that IPD and Pulsewidth are stored losslessy (for both base and pulse level tracks).
Additionally, in order to identify the precise start frame corresponding to a pulse/base event, records in the internal BAM format includes a vector tag recording the start frame of each pulse.
Feature
Tag
Type
Example
Comment
Start frame
sf
B,I
- 10, 15,
23, 28
Array containing the first frame number in the pulse, for every pulse.
Note that the precise start frame of a pulse or base event must be identified using this tag; it cannot safely be identified by doing a cumulative sum of (pulse-width-in-frames + pre-pulse-frames), because occasionally the pre-pulse-frames cannot be represented exactly due to the truncation to max(uint16) before storage in the BAM. (The same held true for RS bas/pls .h5 files).
Pulse features¶
Background: A spike in the trace can be identified as a pulse and a pulse can be called as a base. Thus, the number of pulse calls is at least equal the number of base calls, as they are also pulses. As some pulses do not qualify as bases, the number of pulses is possibly greater than the number of bases.
The pulse BAM extends the vanilla PacBio BAM format with additional per-read tags. These new pulse tags are of equal length, one entry per pulse:
Feature
Tag name
Type
Example
Comment
Pulse call
pc
Z
GaAT
Lowercase used to indicate a pulsecall that was “squashed” by P2B, uppercases are the respective base calls from the SEQ field
LabelQV
pq
Z
20,20,12,20
TODO
AltLabel
pt
Z
—C
“second best” label; ‘-’ if no alternative applicable
MergeQV
pg
Z
TODO
Pulse mean signal (pkmean)
pa
B,S
2,3,2,4
Only includes signal measure for the “called” channel Units: photoelectron
Pulse mid signal (pkmid)
pm
B,S
3,3,4,3
Mean, omitting edge frames Units: photoelectron
Pre-pulse frames
pd
B,S
8,5,5,8
Pre-pulse frames, truncated to max(uint16).
Pulse width (frames)
px
B,S
2,2,4,5
Pulse width in frames, truncated to max(uint16).
Pulse exclusion reason
pe
B,C
0,2,2,0
0=Base, 1=Short Pulse, 2=Burst, 3=Pause (uint8).
Note that we encode the entire pulse stream and its attendant features, even though some of these are at least partially redundant with base-level features.
Baseline sigma¶
Additionally we need to encode the baseline sigma for each channel for a read. The baseline sigma is a piecewise constant function of time, changing at an interval on the order of 10 to 100 seconds (i.e., slowly!). We tally the number of pulses in each interval (“block”) and the baseline sigma for each channel during that block, as follows:
“bs” tag = BaselineSigma = { A_0, C_0, G_0, T_0, A_1, C_1, G_1, T_1, … } (as float32[] / B,f), where subscript denotes block number.
“pb” tag = PulseBlockSize = number of pulses in each block (uint32[], B,I)
Thus, for example, the first pb[0] pulses have baseline sigma bs[0] for the A channel.
Note that for RS data, baseline sigma is only calculated once per ZMW; for Sequel, it is calculated per-time-block per-ZMW, hence the need for an array.
Unresolved questions¶
Where will baseline information be stored? Current plan is to store it in
sts.h5
file (which needs a spec of its own).The pkmid/pkmean values are stored in photoelections, which means we need the gain in order to compute the values in counts. This has internal value when back-converting internal BAMs to pls.h5 files, which uses counts to represent pk values.