=================================================== PacBio Alignment File Format (cmp.h5) Specification =================================================== .. moduleauthor:: Jason Chin, Dale Webster, Susan Tang, Jim Bullard, Mark Chaisson, David Alexander, Dimitris Iliopoulos Revision History ================ .. tabularcolumns:: |r|r|L|J| +---------+------------+--------------------+-------------------------------+ | Version | Date | Authors |Comments | +=========+============+====================+===============================+ | 0.1 | 07/24/2009 | Jason Chin |First draft | +---------+------------+--------------------+-------------------------------+ | 0.2 | 11/04/2009 | Jason Chin |2nd draft, incorporated changes| | | | |from prototype | +---------+------------+--------------------+-------------------------------+ | 0.3 | 11/17/2009 | Susan Tang |Added consensus record | +---------+------------+--------------------+-------------------------------+ | 0.4 | 03/11/2009 | Jason Chin, |Added SF related spec and | | | | James Bullard |indexing proposal | +---------+------------+--------------------+-------------------------------+ | 0.5 | 07/06/2010 | Dale Webster |Added PB Internal Format Spec | +---------+------------+--------------------+-------------------------------+ | | | |Major Revision before v. | | | | |1.2. Remove all reference to | | | | |earlier Astro type cmp.h5. | | | | Jason Chin, |Meta-data group hierarchy | | | | James Bullard, |changed. New attributes | | | | Dale Webster, |added. Define a few file | | 1.2rc | 10/25/2010 | Dimitris |operation behaviors. | | | | Iliopoulos, Ali | | | | | Bashir |We call this document version | | | | |1.2rc to match the software | | | | |release version for FCR. | | | | |Preliminary support for strobe | | | | |read timing information. | +---------+------------+--------------------+-------------------------------+ | | | |Finalize 1.2 spec , updated | | 1.2 | 12/22/2010 | Jason Chin |examples, revise the FileLog | | | | |info group, remove TODO, remove| | | | |"rc" in the version string. | +---------+------------+--------------------+-------------------------------+ |1.3.1 | 03/6/2012 | David Alexander, |QV record types | | | | Mark Chaisson |changed. lastRow datasets | | | | |removed. Converted to | | | | |reStructuredText. Some material| | | | |moved to Appendices | +---------+------------+--------------------+-------------------------------+ |2.0.0 |02/12/2013 | David Alexander, |Addition of chemistry tag | | | | James Bullard |information per-movie. Removal| | | | |of master dataset constructs. | | | | |Sortedness of a file now | | | | |indicated by presence of | | | | |OffsetTable. | +---------+------------+--------------------+-------------------------------+ |2.1.0 |08/01/2013 | James Bullard |Addition of Barcode data | +---------+------------+--------------------+-------------------------------+ |2.3.0 |5/21/2014 | David Alexander |Document revised chemistry | | | | |encoding | +---------+------------+--------------------+-------------------------------+ File Format Versioning ====================== The ``cmp.h5`` file format version is stored in the root group attribute ``Version``. The version may take one of the following values: - "1.2.0" - "1.2.0.SF" - "1.2.0.PB" - "1.3.1.SF" - "1.3.1.PB" - "2.0.0" - "2.1.0" - "2.3.0" File formats with versions ending in ".SF" (for Springfield) represent the production file formats that are produced by instruments at customer sites. File formats with versions ending in ".PB" (for PacBio) may contain additional information. *Version "X.PB" files are always usable wherever an "X.SF" file is usable; i.e. PacBio internal files contain a superset of the features required in a Springfield file, and the same formatting conventions are observed.* Hierarchy Layout ================ In this section, we specify the general layout. At the top-level, or root group of the ``cmp.h5`` HDF5 file, there exist six HDF5 groups which must exist: ``AlnInfo``, ``RefInfo``, ``MovieInfo``, ``AlnGroup``, ``RefGroup``, ``FileLog``. There are basically three different categories of data groups: 1. The *Info* groups contain information about particular aspects of the data contained in the file to some external references, e.g., reference sequences used for alignments, movies information for the reads, and ZMW hole numbers, etc. These groups will be referred to as info groups. (The only exception of such convention is the ``FileLog`` group. It should be considered as an Info group even though the group does have a "Info" suffix. ) 2. The *Group* HDF5 groups contain information about how the data is stored in the file and function as key-value-pair mappings from integer IDs to character paths. Each "Group" HDF5 group will contain at least two datasets one of which will be called ID and the other will be called Path. The ID is the key used to refer to the HDF5 path stored in parallel in the Path dataset. To avoid ambiguity these groups will be referred to as mapping groups. 3. Additionally, at the top-level of the file, zero or more alignment data groups will exist---these groups contain the actual alignment data for each reference sequence and alignment group. These groups will be called data groups. All datasets stored under the same HDF5 group irrespective of type shall always have the same number of rows or, in the case of dimensionless vectors, length. Here we specify the minimal set of datasets in each of aforementioned groups: 1. An info group named ``AlnInfo`` containing information about each alignment stored in the file. The ``AlnInfo`` group should contain the following datasets: a. ``AlnIndex``: Dataset whose rows represent unique alignments and whose columns store relevant information about each alignment. The ``AlnIndex`` dataset has a string list attribute, ``ColumnNames``, containing the names of the columns of this dataset. b. (CCS only): A vector dataset ``NumPasses``, of the same length as ``AlnIndex``, indicating the number of CCS subreads that were used to generate the consensus read in the corresponding row of ``AlnIndex``. b. (Optional) Vector datasets, of the same length as ``AlnIndex``, the same storing information about each alignment (e.g., ``ZScore``, ``SNR``, and ``Edna``). 2. An info group named ``RefInfo`` containing information about the reference sequences used during alignment. The ``RefInfo`` group should contain the following datasets: a. ``ID``: Identifier of the record. b. ``FullName``: Name of the sequence as given by the FASTA file used during alignment. c. ``MD5``: md5 hashes of the DNA sequence used during alignment. .. note:: The MD5 convention used in cmp.h5 files differs from the standard convention in SAM files. SAM files store the "MD5 checksum of the sequence in the uppercase, with gaps and spaces removed." *cmp.h5 files contain the MD5 checksums of the reference contig sequences as present in the refernece FASTA file---case preserved, spaces and gaps intact (but newlines removed).* d. ``Length``: The length of the DNA sequence used during alignment. 3. An info group named ``MovieInfo`` containing information about the movies which produced the alignments. This ``MovieInfo`` group should contain the following datasets: a. ``ID``: Identifier of the record. b. ``Name``: Movie name. c. ``FrameRate``: The camera speed in frames per second used to record the movie. d. Datasets encoding information about the sequencing chemistry that was used. This is encoded in one of two manners: 1. Datasets ``SequencingKit``, ``BindingKit``, and ``SoftwareVersion`` represent the partnumbers read by the instrument barcode reader for each movie run, as well as the basecaller version. Decoding of this identifying "triple" for each movie is deferred to the tools that actually need to know the chemistry details---specifically, the Quiver variant/consensus calling tool and the base-modification identification tools. 2. *(Versions 2.2.0 and earlier, and manual override in 2.3.0 and after)* Dataset ``SequencingChemistry``, representing a canonical string representation (for example, "P4-C2") of the chemistry. Note that this places the burden for decoding of the barcode information on the software that constructs the ``cmp.h5`` rather than client software. Software that parses the ``cmp.h5`` format shall rely on the datasets in (1) as the canonical chemistry information, only falling back to the information in (2) if the datasets in (1) are absent. 4. An info group named ``FileLog`` containing information about the history of the file itself. a. ``ID``: Identifier of the record b. ``Program``: The name of the program that touches the file c. ``Version``: The version of the program that touches the file d. ``Timestamp``: A `W3C compatible timestamp`_ string of the date-time when the file is touched. e. ``CommandLine``: Detail command line string that details how the program is used f. ``Log``: The field to store any extra details 5. A mapping group named ``RefGroup`` that records the reference sequence information used in the alignments: The ``RefGroup`` group should contain the following datasets: a. ``ID`` b. ``Path`` c. ``RefInfoID``: ``RefInfoID`` refers to elements of the ``/RefInfo/ID`` dataset. 6. A mapping group named ``AlnGroup`` that records the different partitions of alignments. This data group should contains: a. ``ID`` b. ``Path`` 7. Zero or more data groups containing the actual alignments. The names of the groups are defined by the dataset ``/RefGroup/Path``. Each reference group contains one or more alignment groups (representing alignments from some predefined grouping, such as: SMRTcell, acquisition, or movie, etc). The full HDF5 paths to the alignment groups including the group names are defined in the dataset ``/AlnGroup/Path``. An alignment group should contain: a. A single alignment array dataset named ``AlnArray`` b. (Optional) Datasets for quality values and pulse features that can be aligned to the read bases. Detailed information about necessary datasets is defined in sections 10 and 11. 8. (Optional) User-defined datasets conforming to the conventions of simple HDF5 types and having the same length as each sibling in its containing group. It may be helpful to inspect the output of *h5ls* applied to a 1.3.1.SF cmp.h5 file:: mp-f052:~ $ h5ls -r ~/Data/new_cmph5/alignments.cmp.h5 / Group /AlnGroup Group /AlnGroup/ID Dataset {1/Inf} /AlnGroup/Path Dataset {1/Inf} /AlnInfo Group /AlnInfo/AlnIndex Dataset {16866/Inf, 22/Inf} /FileLog Group /FileLog/CommandLine Dataset {3/Inf} /FileLog/ID Dataset {3/Inf} /FileLog/Log Dataset {3/Inf} /FileLog/Program Dataset {3/Inf} /FileLog/Timestamp Dataset {3/Inf} /FileLog/Version Dataset {3/Inf} /MovieInfo Group /MovieInfo/FrameRate Dataset {1/Inf} /MovieInfo/SequencingChemistry Dataset {1/Inf} /MovieInfo/ID Dataset {1/Inf} /MovieInfo/Name Dataset {1/Inf} /RefGroup Group /RefGroup/ID Dataset {1/Inf} /RefGroup/OffsetTable Dataset {1/Inf, 3/Inf} /RefGroup/Path Dataset {1/Inf} /RefGroup/RefInfoID Dataset {1/Inf} /RefInfo Group /RefInfo/FullName Dataset {1/Inf} /RefInfo/ID Dataset {1/Inf} /RefInfo/Length Dataset {1/Inf} /RefInfo/MD5 Dataset {1/Inf} /ref000001 Group /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0 Group /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/AlnArray Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/DeletionQV Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/DeletionTag Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/IPD Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/InsertionQV Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/MergeQV Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/PulseWidth Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/QualityValue Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/SubstitutionQV Dataset {39434696/Inf} /ref000001/m120225_045819_richard_c100304312550000001523012308061200_s1_p0/SubstitutionTag Dataset {39434696/Inf} Root Group Attributes ===================== The following mandatory string attributes should be set in the root group: +-------------+----------------+------------------------------------+ | Name | Allowed Values | Comment | +=============+================+====================================+ | | "1.2.0" | | | | "1.2.0.SF" |The suffix is used to indicate | | | "1.2.0.PB" |whether the file includes (".SF") or| | | "1.3.1.SF" |does not include (".PB") several | | | "1.3.1.PB" |datasets useful for in-house | | Version | "2.0.0" |analyses. | +-------------+----------------+------------------------------------+ | | |Set to "standard" by default. If the| | | "RCCS", "CCS", |cmp.h5 is used for "RCCS" and "CCS",| | ReadType | "strobe", |there will be no pulse | | | "standard", or |features. Each read type will allows| | | "cDNA" |different sets of optional tables. | +-------------+----------------+------------------------------------+ | | The command |This attribute is reserved for the | | CommandLine | line used for |initial generation. All | | | generating |post-initial alignment information | | | this file. |should be stored in FileLog | +-------------+----------------+------------------------------------+ Mapping Groups: ``ID``, and ``Path`` datasets ============================================= Each mapping group contains at least an ``ID`` and ``Path`` dataset. The ID dataset contains unique positive integer values. The ``Path`` dataset contains proper HDF5 paths to HDF5 groups within the file. Elements of the path dataset should conform to the following regular expression (leading forward slash not included): "[a-zA-Z\-+_0-9]+" (all lower and upper case ASCII characters, numbers, "-", and "+"). The ID, Path datasets function as key-value pair mappings. The individual IDs are used in datasets to reference the relevant information stored in this particular mapping group. The following `HDF5 DDL`_ defines the hdf5 data types for these data sets:: DATASET "ID" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "Path" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } Two datasets are used to avoid compound types in an HDF5 file. This avoids the complication in reader/writer code implementations. If there is a mature compound type code base within the PBI development environment, compound type datasets are recommended for storing such key-value pairs. ``RefGroup`` data group and ``/RefGroup/*`` datasets ==================================================== The ``RefGroup`` mapping group provides a mapping between reference sequence identifiers (``ID``) to HDF5 paths in the file (``Path``). An example HDF5 schema can be seen above. A ``RefInfoID`` data set is used for pointing to the ``ID`` dataset in the RefInfo group and can be viewed as a foreign key. The following DDL code block defines the data types for the datasets and attributes associated with ``/RefGroup``:: GROUP "RefGroup" { DATASET "ID" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "Path" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "RefInfoID" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } } ``AlnGroup`` data group: ``/AlnGroup/*`` datasets ================================================= The ``AlnGroup`` mapping group provides a mapping between alignment group identifiers (``ID``) to alignment group paths. The following DDL code block defines the data types for the datasets and attributes associated with ``/AlnGroup``:: GROUP "AlnGroup" { DATASET "ID" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "Path" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } } ``RefInfo`` info group and ``/RefInfo/*`` datasets ================================================== The ``RefInfo`` info group provides information about the reference sequences used during alignment. The ``RefInfo`` group contains at least 4 datasets including the ``ID`` dataset. The ``RefInfo/FullName`` provides the name of the sequence aligned to and is the full FASTA name. The ``RefInfo/MD5`` is an ``MD5`` hash of the reference sequence aligned to. The ``RefInfo/Length`` provides the length of the sequence aligned to. Other sequence specific annotations can be stored as parallel datasets at this level. The following DDL code block defines the data types for the datasets and attributes associated ``/RefInfo``:: GROUP "RefInfo" { DATASET "FullName" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "ID" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "Length" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "MD5" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } } ``MovieInfo`` data group: ``MovieInfo/*`` datasets ================================================== The paired arrays ``MovieInfo/ID`` and ``MovieInfo/Name`` in the ``MovieInfo`` group are defined to indicate the source of the movies for the reads in the ``AlnInfo/AlnIndex`` dataset. This pair of arrays functions as a key-value-pair map between IDs and movie names. The following DDL code block defines the data types for the datasets and attributes associated ``/MovieInfo``:: GROUP "MovieInfo" { DATASET "ID" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } DATASET "Name" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } } ``AlnInfo`` data group and the ``AlnArray`` data sets ===================================================== ``AlnInfo`` data group ---------------------- The first column of the AlnIndex can be treated as the equivalent "ID" dataset in the mapping or the info groups. The data types of the dataset ``AlnIndex`` are defined as:: DATASET "AlnIndex" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( *, 22 ) / ( H5S_UNLIMITED, 22 ) } } ``AlnIndex`` dataset -------------------- The purpose of the ``AlnIndex`` dataset is to: 1. Store the information necessary to retrieve alignments from the file. This includes: path, beginning offset, and ending offset within the dataset containing the alignment. (This kind of reference to alignment is similar to that proposed by HDF5 group in the bioHDF5 specification.) 2. Store the information, e.g., the orientation (i.e., strand) of the alignment, for processing the alignment properly for downstream bioinformatics analysis and visualization. 3. Store information that can be used to indentify the original reads. 4. Store the unique unsigned 32 bit integer ID as single unique key for each individual alignment. 5. Store summary information about the alignment. For example, one can store the number of matches, mismatches, insertions, deletions, mapping quality, read level quality values, etc. ``AlnIndex`` Dataset Columns ---------------------------- The 22 columns in the `AlnIndex` dataset are described in the table below. .. tabularcolumns:: |p{1in}|L|L| +--------------+--------------------------+-----------------------------+ | Column Name |Meaning | Comment | +==============+==========================+=============================+ | | | Each alignment should | | | | have a unique AlnID. No | | AlnID |Non-zero unique 32 bit | other assumption about | | |integer key for the | the order of the AlnID | | |alignment record | should be used for data | | | | processing. | +--------------+--------------------------+-----------------------------+ | |A foreign key referring to| | | AlnGroupID |AlnGroup/ID | | +--------------+--------------------------+-----------------------------+ | |A foreign key referring to| | | MovieID |MovieInfo/ID | | +--------------+--------------------------+-----------------------------+ | |A foreign key referring to| | | RefGroupID |RefGroup/ID. | | +--------------+--------------------------+-----------------------------+ | |The start position | tStart should always be | | |(0-based, inclusive) of | less than tEnd, even when | | tStart |the alignment target (the | the hit is against the | | |reference sequence) | opposite strand. | +--------------+--------------------------+-----------------------------+ | |The end position (0-based,| | | |not-inclusive) of the | tEnd should always be | | |alignment target (the | greater than tStart, even | | tEnd |reference sequence) | when the hit is against | | | | the opposite strand. | +--------------+--------------------------+-----------------------------+ | | | The read base should | | |The relative strand in the| never be | | |alignment. 1 for reversed | reverse-complimented in | | |reference strand; 0 for | the alignment array, so | | RCRefStrand |forward-forward alignment | we only need to record if | | | | the reference bases are | | | | presented in reverse | | | | complemented strand in | | | | the file. "1" means | | | | "Yes/True" here. | +--------------+--------------------------+-----------------------------+ | HoleNumber |The HoleNumber from the | | | |bas.h5 | | +--------------+--------------------------+-----------------------------+ | SetNumber | | | +--------------+--------------------------+-----------------------------+ | StrobeNumber | Context dependent value. | | | ExonNumber | When the read type is | | | | Strobe, this field is the| | | | strobe number. When the | | | | read type is cDNA it will| | | | be the exon number. | | +--------------+--------------------------+-----------------------------+ | | | If multiple subreads are | | | | from the same physical | | | | origin, they should have the| | MoleculeID |An integer which is unique| same MoleculeID and | | |to all subreads from the | different physical origins | | |same ZMW. | should have different | | | | MoleculeID. | +--------------+--------------------------+-----------------------------+ | |The start position | Regardless weather the | | |(0-based, inclusive) of | alignment is a subread or | | rStart |the read in the alignment | not, the position is | | | | always relative to the | | | | original raw full read | | | | sequence. | +--------------+--------------------------+-----------------------------+ | |The end position (0-based,| | | |not-inclusive) of the read| rEnd should always be | | rEnd |in the alignment | greater than rStart. | +--------------+--------------------------+-----------------------------+ | MapQV |TBD | | +--------------+--------------------------+-----------------------------+ | |Number of matched base in | | | nM |the alignment | | +--------------+--------------------------+-----------------------------+ | |Number of mis-matched base| | | nMM |in the alignment | | +--------------+--------------------------+-----------------------------+ | |Number of insertions in | | | |the read relative to the | | | nIns |reference sequence | | +--------------+--------------------------+-----------------------------+ | |Number of deletions | | | |(missing bases) in the | | | nDel |read relative to the | | | |reference sequence | | +--------------+--------------------------+-----------------------------+ | |The beginning position | | | |(0-based, inclusive) of | | | Offset_begin |the alignment in the | | | |AlignmentArray | | +--------------+--------------------------+-----------------------------+ | |The ending position | | | |(0-based, exclusive) the | Not including the padded | | Offset_end |alignment in the | zero of the alignment | | |AlignmentArray | array. | +--------------+--------------------------+-----------------------------+ | |Used for faster access to | See the sorting and | | nBackRead |blocks of sorted reads | indexing section | +--------------+--------------------------+-----------------------------+ | |Used for faster access to | See the sorting and | | nReadOverlap |blocks of sorted reads | indexing section | +--------------+--------------------------+-----------------------------+ The column names should be stored as an attribute ``ColumnNames`` that contains all names listed in "Column Name" in the table above. Sequence Alignments =================== Binary Encoding for Alignment Pair ---------------------------------- The *alignment array* is a one dimensional 8 bit unsigned integer array where the individual array elements represent a "read base - reference base" pair packed into one byte. The higher four bits are set by the read base and the lower four bits are set by the reference base as the following:: 0 0 0 0 0 0 0 0 T G C A T G C A For example, "T" and "T" matched alignment will be presented as 0b10001000=136. "T" vs. "G" mismatch will be represented as 0b10000100=132. Insertion of "T" in read will be 0b10000000=128. "No-call" ("N") bases are encoded as 0b1111=15 for both read and reference. In the ``AlnArray`` dataset, the encoded read base should be always the same as what has been observed by the sequencing machine without any complementation. If a read is aligned to the reverse complement strand of the reference sequence, the lower four bits represent the complemented base (i.e., the reference has been complemented). Alignment Array --------------- The example below shows the conversion of an alignment pair to the binary array represented as an integer:: Alignment: Read Bases: ATCTT--ATC-GTTAATTA--A Ref. Bases: A-CTCAGA-CAGTCAATTAGCA Encoded Alignment Pairs: AA -> 17 T- -> 128 CC -> 34 TT -> 136 TC -> 130 -A -> 1 -G -> 4 ... -C -> 2 AA -> 17 The final encoded array for this alignment is [17, 128, 34, 136, 130, 1, 4, 17, 128, 34, 1, 68, 136, 130, 17, 17, 136, 136, 17, 4, 2, 17, 0]. Note that zero is padded at the end of each alignment as a separator between different alignments. This will enable some analysis by simply streaming the alignment array without extra index look-ups to separate different alignments. The alignment array is a concatenation of all encoded alignment arrays of each read and the AlignmentIndex dataset is used to indentify the origin of each alignment. Below is an example of the HDF5 type definition for an AlnArray:: DATASET "AlnArray" { DATATYPE H5T_STD_U8LE DATASPACE SIMPLE { ( * ) / ( H5S_UNLIMITED ) } } Pulse Metrics and QVs ===================== In addition to the basic and required ``AlnArray`` dataset present in each alignment group, pulse metrics and quality values (QVs) may be optionally provided; however if one of these features is provided for one alignment group they must be provided for all alignment groups. These optional datasets are: - ``DeletionQV``, - ``DeletionTag``, - ``InsertionQV``, - ``MergeQV``, - ``SubstitutionQV``, - ``SubstitutionTag``, - ``QualityValue``, - ``IPD``, - ``PulseWidth``, - ``StartFrame``, - ``pkmid`` Each such dataset is of the same shape as the ``AlnArray`` dataset in the same alignment group. Missing values (corresponding to read gaps in the alignment array) are encoded based on the type of the dataset: +------------+---------------+ |Data type |Missing value | | |encoding | +============+===============+ |float32 |NaN | +------------+---------------+ |int8 (char) |'-' (ASCII 42) | +------------+---------------+ |uint8 |255 | +------------+---------------+ |uint16 |65535 | +------------+---------------+ A missing value is present at a dataset offset if and only if that offset corresponds to a read gap in the `AlnArray`. For the types of the pulse metric and QV datasets, see `Summary of Attributes and Datasets`_. Any offset into a pulse metric or QV dataset corresponds to the same offset in the ``AlnArray``. Specification for the ``cmp.h5`` used for automatic data analysis from the instrument ===================================================================================== This section defines the constraints that a cmp.h5 file should satisfy for automatic data analysis for an SpringField instrument. Such files are labeled with a root group attribute ``Version`` of "1.2.0.SF" or "1.3.1.SF". The ``RefGroup/Path`` for 1.2.0.SF and 1.2.0.PB ``cmp.h5`` files has the form of "ref%06d" (C string formatting convention). The original FASTA sequence header should be stored in the ``RefInfo/FullName`` dataset. Additionally, two other datasets are obligatory: ``RefInfo/Length`` and ``RefInfo/MD5``. The default of ``AlnGroup`` partition is to group alignments from the same movie that aligned to the same reference together and we use the movie filename without suffix as the default alignment group name. Specification for the ``cmp.h5`` used for PacBio internal data analysis ======================================================================= In addition to all datasets specified for the standard ``cmp.h5`` the following additional datasets are required in internal files ("1.2.1.PB"): 1. Within the info group named "MovieInfo" containing information about the movies which produced the alignments: - ``Exp``: A uint32 dataset specifying the PacBio LIMS Experiment code associated with each movie in the corresponding ``/MovieInfo/Name`` dataset. - ``Run``: A uint32 dataset specifying the PacBio LIMS Run code associated with each movie in the corresponding ``/MovieInfo/Name`` dataset. Data type and data space definition:: DATASET "/MovieInfo/Exp" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) } } DATASET "/MovieInfo/Run" { DATATYPE H5T_STD_U32LE DATASPACE SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) } } 2. Within the info group named ``AlnInfo`` containing information about each alignment stored in the file: - ``ZScore``: a float32 dataset containing the alignment significance score ("Z Score") computed from the corresponding row of the ``/AlnInfo/Index`` table. Data type and data space definition:: DATASET "/AlnInfo/ZScore" { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 310 ) / ( H5S_UNLIMITED ) } } 3. In addition to all attributes specified for the standard ``cmp.h5`` the following additional root level attributes are required: .. tabularcolumns:: |p{1in}|L|J|p{3in}| +---------------+-----------------+------------------+-------------------------+ |Attribute name | Type | Sample values |Comment | +===============+=================+==================+=========================+ | | | |Contains the directory | |ReportsFolder | string |"Analysis_Reports"|name of the Primary | | | | |Analysis Reports used for| | | | |this alignment | +---------------+-----------------+------------------+-------------------------+ | | | |Contains the Perforce | |PrimaryPipeline| string | "61453" |changelist number of the | | | | |Primary Analysis Pipeline| | | | |used for this alignment | +---------------+-----------------+------------------+-------------------------+ Sorting, Flattening, Merging, Splitting and Filtering Behaviors =============================================================== Sorting ------- In order to provide fast access to cmp.h5 files, we provide sorted cmp.h5 files. These files have some additional information to quickly retrieve contiguous regions according to an indexing scheme. The most typical use case is to obtain a set of reads overlapping a particular genomic region, where the region can be a single genomic coordinate or ranges of genomic coordinates. Note that by default, *sorting* only entails the sorting of the ``AlnIndex`` dataset, and not the sorting of the alignment data itself. A sorted cmp.h5 file has the following additional items as compared to an unsorted cmp.h5: 1. A dataset ``OffsetTable`` stored within the ``RefGroup`` mapping group giving the offsets of the reads mapped to a reference sequences in the global alignment index. The dataset is a 3 by N unsigned 32 bit unsigned integer array, where N is the total number of reference sequences in the ``RegGroup/ID`` table. The three elements of each row in the array indicate the ``RefID``, ``targetStartOffset``, and ``targetEndOffset``. The ``targetStartOffest`` and ``targetEndOffset`` give the range of the reads in the global ``/AlnInfo/AlnIndex`` that maps to the specific reference sequence in the first column of the dataset. /The presence or absence of the ``OffsetTable`` dataset should be used to determine whether the file is sorted or unsorted./ 2. The alignment index will have two additional columns of unsigned 32-bit integers (these could be shorter) ``nBackRead`` and ``nReadOverlap`` which gives the maximum number of reads one needs to examine to determine overlap and the actual number of reads which overlap a position, respectively. A value of -1 indicates that the field has not been filled in, whereas a value of 0 means that no further reads possibly overlap the position of interest. Here, nBackRead > nReadOverlap is always true. 3. In addition to sorting the ``AlnIndex``, sorting and indexing can perform a "flattening" operation whereby all AlnGroups under each RefGroup are merged into a single AlnGroup. The name of the single AlnGroup can be anything, however, convention is to use the name: "rg-0001" to indicate that the sub-datasets have been merged and re-ordered. Additionally, an attribute on this group: repacked will be set to 1 to indicate, irrespective of the name, that the datasets have been sorted. If the length of any of the child datasets of of a "repacked" alignment group would be greater than 2^32, then additional alignment groups are added serially, e.g., "rg-0002", etc. An alignment will never span more than one alignment group. .. note:: The time complexity of sorting a cmp.h5 file will be on the order of O(n log(n)). Additionally, the columns ``nBackRead`` and ``nReadOverlap`` need to be computed. This will be on the order of O(max(read length) * n). Access to a given start position in cmp.h5 will be O(log(n)), however, this will only produce reads having that start position. In order to obtain all reads overlapping a position, one needs to inspect the ``nBackRead`` to obtain the size of the slice that they should grab from the cmp.h5 file. Retrieval, therefore, is bound by O(nBackRead log(n)). The additional column, ``nReadOverlap``, should allow one to obtain significantly better performance, as the search can stop once to obtained number of reads is equal to ``nReadOverlap``. Merging ------- Merging is performed on a list of cmp.h5 files by selecting the first file to act as the seed and sequentially merging the rest onto the seed. If the first file in the list of files to be merged is empty then the next non-empty file is selected to act as the seed. An exact copy of the seed is made where all ID-type datasets have their entries serialized to consecutive 32 bit integers starting from 1. Merging results modifies a copy of the seed file in place. For each cmp.h5 file in the merging list, the following steps are performed: 1. If the file is empty, its root group Version does not match the seed's Version or does not have the same type of loaded PulseMetrics as the seed, it is removed from the merging list and the next file is considered. 2. Root group attributes are not merged since they are set to the seed's Root group attributes. 3. Datasets under the seed's AlnInfo Data group are extended with their counterparts from the file to be merged. The AlnID column of the newly added rows in the AlnInfo/Index is updated by resetting the old values from the merged file. The new values are set equal to a list of integers starting from the maximum AlnID of the seed + 1, adding 1 for each new AlnID from the merged file. 4. Datasets under the seed's RefInfo, MovieInfo, AlnGroup and RefGroup data groups are extended only with new entries from their counterparts in the file to be merged. If new RefInfo/ID, RefGroup/ID or MovieInfo/ID entries are created, they are mapped back to their respective columns in the AlnInfo/Index. After going through the entire list of files to be merged, the FileLog attribute from the Root group attributes is modified (TBD). Splitting --------- The current splitting behavior is implementation specific and associated with a single use case, i.e., processing of .cmp.h5 files involved in Edna analysis- type workflows. It is our aim to generalize the splitting behavior to accommodate more use cases when those become available. A master cmp.h5 file is split into an N number of cmp.h5 files where N is equal to the number of RefInfo/ID entries in the master file. Consequently, each new cmp.h5 file contains all data associated with a single reference sequence. This is done by: 1. Creating N copies of the master cmp.h5 file and sequentially selecting a RefInfo/ID entry to become the only entry for each copied file, unique amongst the group. 2. Resizing all datasets belonging to AlnInfo, RefInfo, MovieInfo, AlnGroup and RefGroup by deleting all entries that are not associated with the chosen reference sequence. Splitting maintains the values of all ID-type fields and data fields in the AlnInfo/Index rows. 3. Maintaining the size and content of the AlnArray and PulseMetric-type datasets in the new files as the ones in the master. Barcode Information =================== In addition to the afforementioned core alignment information, the cmp.h5 file can be used to store optional datasets containing ``barcode`` annotation on alignments. The pattern leveraged to store this annotation demonstrates a general mechanism to extend the information stored in the cmp.h5 file for downstream applications. In the case of barcoding, we wish to label alignments according to their barcode so that other applications can leverage this information when computing statistics over sets of alignments, e.g., consensus calling within sample. To this end, a parallel dataset to ``/AlnInfo/AlnIndex`` is created. The ``Barcode`` dataset is 32-bit integer matrix with the same number of rows as the ``AlnIndex`` dataset and 5 columns storing scoring and labeling information. The ``Barcode`` dataset contains the total number of barcodes scored for this molecule (``count``), the index of the top-scoring barcode (``index1``), the score of the top-scoring barcode (``score1``), the index of the 2nd-highest scoring barcode (``index2``) and its score (``score2``). These columns are named in the attribute ``ColumnNames`` of the ``Barcode`` dataset. The ``index1`` and ``index2`` are foreign-keys into the ``BarcodeInfo/ID`` dataset. Analagous to the other *Info datasets, the ``BarcodeInfo/ID`` and ``BarcodeInfo/Name`` are used to retrieve the human-readable name of the barcode. Summary of Attributes and Datasets ================================== Versions prior to 2.0.0 are described in the Appendices. **File Version 2.0.0 contents:** +------------+------+--------------------+----------+-------+-----------+ |Parent Group| HDF5 |Resource Name |Data type | Shape| | | | data | | | | | +============+======+====================+==========+=======+===========+ |/ |ATTR |CommandLine |VLEN_STR | None| required | +------------+------+--------------------+----------+-------+-----------+ |/ |ATTR |Index |VLEN_STR | (3,)| optional | +------------+------+--------------------+----------+-------+-----------+ |/ |ATTR |ReadType |VLEN_STR | None| required | +------------+------+--------------------+----------+-------+-----------+ |/ |ATTR |Version |VLEN_STR | None| required | +------------+------+--------------------+----------+-------+-----------+ |/AlnGroup |DS |ID |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/AlnGroup |DS |Path |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/AlnInfo |DS |AlnIndex |uint32 | 22| required | +------------+------+--------------------+----------+-------+-----------+ |/AlnInfo |ATTR |ColumnNames |VLEN_STR | 22| required | +------------+------+--------------------+----------+-------+-----------+ |/FileLog |DS |CommandLine |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/FileLog |DS |ID |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/FileLog |DS |Log |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/FileLog |DS |Program |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/FileLog |DS |Timestamp |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/FileLog |DS |Version |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/MovieInfo |DS |ID |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/MovieInfo |DS |Name |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/MovieInfo |DS |FrameRate |float32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/MovieInfo |DS |SequencingChemistry |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |AlnArray |uint8 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |QualityValue |uint8 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |DeletionQV |uint8 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |InsertionQV |uint8 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |MergeQV |uint8 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |SubstitutionQV |uint8 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |SubstitutionTag |char | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |DeletionTag |char | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |IPD |uint16 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |PulseWidth |uint16 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/ref*/* |DS |PulseIndex |uint32 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/RefGroup |DS |ID |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/RefGroup |DS |OffsetTable |uint32 | 3| optional | +------------+------+--------------------+----------+-------+-----------+ |/RefGroup |DS |Path |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/RefGroup |DS |RefInfoID |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/RefInfo |DS |FullName |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/RefInfo |DS |ID |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/RefInfo |DS |Length |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/RefInfo |DS |MD5 |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/BarcodeInfo|DS |ID |uint32 | 1| optional | +------------+------+--------------------+----------+-------+-----------+ |/BarcodeInfo|DS |ID |uint32 | 1| required | +------------+------+--------------------+----------+-------+-----------+ |/BarcodeInfo|DS |Name |VLEN_STR | 1| required | +------------+------+--------------------+----------+-------+-----------+ .. _HDF5 DDL: http://www.hdfgroup.org/HDF5/doc/ddl.html .. _W3C compatible timestamp: http://www.w3.org/TR/NOTE-datetime