=================================== BAM format additions for *PacBio-subreads* =================================== PacBio-subread BAM flavors ========================== Data generated by the PacBio basecaller is stored in **subreads** and **scraps** BAM files. Those files are consumed by CCS to generate HiFi reads. For PacBio in-house analysis, those files can be used to measure and characterize base calling performance or develop new methods for HiFi generation. Those use cases require extra information to be carried in our BAM files. The subreads and scraps files are fully compliant with the PacBio BAM spec (with spec version noted in the ``@HD::pb`` tag) but will include additional per-read tags containing additional information. QNAME convention ================ By convention the ``QNAME`` ("query template name") for unrolled reads and subreads is in the following format:: {movieName}/{holeNumber}/{qStart}_{qEnd} where ``[qStart, qEnd)`` is the 0-based coordinate interval representing the span of the *query* in the ZMW read, as above. BAM filename conventions ======================== Since we will be using BAM format for different kinds of data, we will use a ``suffix.bam`` filename convention: +------------------------------------+------------------------------+ | Data type | Filename template | +====================================+==============================+ | ZMW reads from movie | *movieName*.zmws.bam | +------------------------------------+------------------------------+ | Analysis-ready subreads :sup:`1` | *movieName*.subreads.bam | | from movie | | +------------------------------------+------------------------------+ | Excised adapters, barcodes, and | *movieName*.scraps.bam | | rejected subreads | | +------------------------------------+------------------------------+ | Aligned subreads in a job | *jobID*.aligned_subreads.bam | +------------------------------------+------------------------------+ :sup:`1` Data in a ``subreads.bam`` file should be ``analysis ready``, meaning that all of the data present is expected to be useful for down-stream analyses. Any subreads for which we have strong evidence will not be useful (e.g. double-adapter inserts, single-molecule artifacts) should be excluded from this file and placed in ``scraps.bam`` as a ``Filtered`` with an SC tag of ``F``. Use of headers for file-level information ========================================= Beyond the usual information encoded in headers that is called for SAM/BAM spec and what is added for customer-facing PacBio BAM files, we encode special information as follows. ``@RG`` (read group) header entries: ``DS`` tag ("description"): contains some semantic information about the reads in the group, encoded as a semicolon-delimited list of "Key=Value" strings, as follows: **Base feature manifest---absent item means feature absent from reads:** +---------------------+-----------------------------------------+----------------+ | Key | Value spec | Value example | +=====================+=========================================+================+ | DeletionQV | Name of tag used for DeletionQV | dq | +---------------------+-----------------------------------------+----------------+ | DeletionTag | Name of tag used for DeletionTag | dt | +---------------------+-----------------------------------------+----------------+ | InsertionQV | Name of tag used for InsertionQV | iq | +---------------------+-----------------------------------------+----------------+ | MergeQV | Name of tag used for MergeQV | mq | +---------------------+-----------------------------------------+----------------+ | SubstitutionQV | Name of tag used for SubstitutionQV | sq | +---------------------+-----------------------------------------+----------------+ | SubstitutionTag | Name of tag used for SubstitutionTag | st | +---------------------+-----------------------------------------+----------------+ Use of read tags for per-read information ========================================= +-----------+------------+-------------------------------------------------------------------------+ | **Tag** | **Type** | **Description** | +===========+============+=========================================================================+ | ws | i | Start of first base of the query ('qs') in approximate raw frame count | | | | since start of movie. | +-----------+------------+-------------------------------------------------------------------------+ | we | i | Start of last base of the query ('qe - 1') in approximate raw frame | | | | count since start of movie. | +-----------+------------+-------------------------------------------------------------------------+ Use of read tags for per-read-base information ============================================== The following read tags encode features measured/calculated per-basecall. Unlike ``SEQ`` and ``QUAL``, aligners will not orient these tags. They will be maintained in *native* orientation (in the same order and sense as collected from the instrument) even if the read record has been aligned to the reverse strand. +-----------+---------------+----------------------------------------------------+ | **Tag** | **Type** |**Description** | +===========+===============+====================================================+ | dq | Z | DeletionQV | +-----------+---------------+----------------------------------------------------+ | dt | Z | DeletionTag | +-----------+---------------+----------------------------------------------------+ | ip | B,C *or* B,S | IPD (raw frames or codec V1) | +-----------+---------------+----------------------------------------------------+ | iq | Z | InsertionQV | +-----------+---------------+----------------------------------------------------+ | mq | Z | MergeQV | +-----------+---------------+----------------------------------------------------+ | pw | B,C *or* B,S | PulseWidth (raw frames or codec V1) | +-----------+---------------+----------------------------------------------------+ | sq | Z | SubstitutionQV | +-----------+---------------+----------------------------------------------------+ | st | Z | SubstitutionTag | +-----------+---------------+----------------------------------------------------+ Notes: - QV metrics are ASCII+33 encoded as strings - *DeletionTag* and *SubstitutionTag* represent alternate basecalls, or "N" when there is no alternate basecall available. In other words, they are strings over the alphabet "ACGTN". How to annotate scrap reads =========================== Reads that belong to a read group with READTYPE=SCRAP have to be annotated in a hierarchical fashion: 1) Classification with tag *sz* occurs on a per ZMW level, distinguishing between spike-in controls, sentinels of the basecaller, malformed ZMWs, and user-defined templates. 2) A region-wise annotation with tag *sc* to label adapters, barcodes, low-quality regions, and filtered subreads. +-----------+---------------+-----------------------------------------+ | **Tag** | **Type** |**Description** | +===========+===============+=========================================+ | sz | A | ZMW classification annotation, one of | | | | N:=Normal, C:=Control, M:=Malformed, | | | | or S:=Sentinel :sup:`1` | +-----------+---------------+-----------------------------------------+ | sc | A | Scrap region-type annotation, one of | | | | A:=Adapter, B:=Barcode, L:=LQRegion, | | | | or F:=Filtered :sup:`2` | +-----------+---------------+-----------------------------------------+ :sup:`1` reads in the subreads/hqregions/zmws.bam file are implicitly marked as Normal, as they stem from user-defined templates. :sup:`2` sc tags 'A', 'B', and 'L' denote specific classes of non-subread data, whereas the 'F' tag is reserved for subreads that are undesirable for downstream analysis, e.g., being artifactual or too short. Subread local context ===================== Some algorithms can make use of knowledge that a subread was flanked on both sides by adapter or barcode hits, or that the subread was in one orientation or the other (as can be deduced when asymmetric adapters or barcodes are used). To facilitate such algorithms, we furnish the ``cx`` bitmask tag for subread records. The ``cx`` value is calculated by binary OR-ing together values from this flags enum:: enum LocalContextFlags { ADAPTER_BEFORE = 1, ADAPTER_AFTER = 2, BARCODE_BEFORE = 4, BARCODE_AFTER = 8, FORWARD_PASS = 16, REVERSE_PASS = 32, ADAPTER_BEFORE_BAD = 64, ADAPTER_AFTER_BAD = 128 }; Orientation of a subread (designated by one of the mutually exclusive ``FORWARD_PASS`` or ``REVERSE_PASS`` bits) can be reckoned only if either the adapters or barcode design is asymmetric, otherwise these flags must be left unset. The convention for what is considered a "forward" or "reverse" pass is determined by a per-ZMW convention, defining one element of the asymmetric barcode/adapter pair as the "front" and the other as the "back". It is up to tools producing the BAM to determine whether to use adapters or barcodes to reckon the orientation, but if pass directions cannot be confidently and consistently assessed for the subreads from a ZMW, neither orientation flag should be set. Tools consuming the BAM should be aware that orientation information may be unavailable for subreads in a ZMW, but if is available for any subread in the ZMW, it will be available for all subreads in the ZMW. The ``ADAPTER_*`` and ``BARCODE_*`` flags reflect whether the subread is flanked by adapters or barcodes at the ends. The ``ADAPTER_BEFORE_BAD`` and ``ADAPTER_AFTER_BAD`` flags indicate that one or both adapters flanking this subread do not align to the adapter reference sequence(s). The adapter on this flank could be missing from the pbell molecule, or obscured by a local decrease in accuracy. Likewise, some nearby barcode or insert bases may be missing or obscured. ``ADAPTER_*_BAD`` flags can not be set unless the corresponding ``ADAPTER_*`` flag is set. This tag is mandatory for subread records, but will be absent from non-subread records (scraps, ZMW read, CCS read, etc.) +-----------+---------------+----------------------------------------------------+ | **Tag** | **Type** |**Description** | +===========+===============+====================================================+ | cx | i | Subread local context Flags | +-----------+---------------+----------------------------------------------------+ .. _specifications for BAM/SAM: http://samtools.github.io/hts-specs/SAMv1.pdf .. _SAM tags specifications: http://samtools.github.io/hts-specs/SAMtags.pdf