Overview:

Pacific Bioscience builds a platform that can sequence long reads on the range of 1000 - 60000 bp significantly faster than short read sequencing. It uses a SMRT cell (Single Molecule Real-Time) sequencing technology - a Zero Mode Waveguide (ZMW) well that is smaller than the light wavelength that captures the emission of a fluorophore being cleaved off a dNTP while a polymerase attaches it to a DNA sequence. The real time dynamics allow monitoring of the time between two base incorporations informing about base modifications like m6A and m4C. The technology however is hindered by a high error rate, low throughput and higher per-base cost. Lets dive into the chemistry so we can explain some parameters [1].  

pacbio_cartoon

The DNA template and sequence reading process:

The library - or template - looks like a bell and is called a SMRTbell. Hairpin adapters are linked to the template DNA and Polymerase can bind to either end of the hairpins. The polymerase is immobilized to the bottom of the well. Each well produces a movie (0.5 - 4 h) of light emissions - each fluorophore emits light at a different wavelength. The movie is interpreted into a Continuous Long Read (CLR) - once the polymerase finishes the strand, it continuously revolves around the SMRTbell until it falls off resulting in reading the template several times. Multiple passes on the same template are separated by the adapters so each individual pass on the template is called a subread and the aggregate of all subreads - a consensus sequence - is called a Circular Consensus Sequence (CCS). Some templates are too long for getting a CCS -  they only have one or two subreads. The real-time nature of the process is used to determine the kinetics of polymerase on the template - different rates can determine modifications like methylation to the template [2].

[2]

The circular nature of the SMRTbell DNA template allows polymerase to sequence the same DNA molecule multiple times with multiple passes. This produces high intra-molecular consensus accuracy.

Read length:

Read range from  1- 60 Kb and are distributed Poisson-like with the lambda depending on the type of chemistry. The tradeoff in SMRT cells is between long and short length raw DNA. The shorter length DNA enters into ZMW wells at a higher frequency than long reads - decreasing lambda (and N50). To increase lambda,  a size selection step can be introduced before loading DNA into the SMRT cell. 

reads/readlen

Based on data from a 20 kb size-selected human library using 4-hour movies. Average read lengths for data set shown are 18 kb[/caption]

The longest Illumina reads come from the MiSeq with V3 chemistry and yield 25 million 2 x 300 short reads. The quality on these reads falls down towards the 3` end of the read.

The factors that govern the successful reactions (yield) are:

  1. The number of polymerases that bind to the ZMW - can be 0,1, or 2. Use data with 1 polymerase
  2. The number of events that a single DNA strand enters a ZMW - around 23-46% [3] 
  3. The throughput is 0.5  - 1 gigabases per SMRT cell [4]

An Illumina HiSeq X can generate up to 900 gigabases for a single flow cell with 30% reads passing quality filters. A HiSeq 2500 produces ~8 billion 125 PE over 6 days of a run (16 lanes) a fold difference of 165 - 330 X of RSII.

Base qualities measurement:  

There are two ways to measure base quality:

  1. CCS: shorter templates will have a higher probability to be sequenced more than once. Longer templates will have less subreads in their CCS and therefore less evidence on the base quality.
  2. The error rate of a CLR is typically between 11-15% and will depend on the number of subreads
  3. Aligning multiple subreads can help validate the base call of overlapping regions. Errors are distributed randomly in the CCS therefore the more subreads the higher the confidence in base calls.

Polymerase kinetics and methylation

The PacBio platform takes advantage of the inter pulse duration (IPD); typically a base is called every 3 seconds so delays are a function of different base chemistry. There is no need for an annotated or complete reference genome nor a separate sequencing run to identify methylation:

pacbio_methylation

DNA polymerization runs freely at ~3 bases/second. Alteration of this rate due to the incorporation of nucleotides across modified bases is detected and used to infer the presence of bases other than A, C, T or G. This information is automatically generated and processed during every run.

The Illumina platform can detect methylation using an number of techniques. The main difference between RSII chemistry and Illumina is that Pac Bio methylation detection is a byproduct of a sequencing run whereas detecting methylation with Illumina platform requires a dedicated method involving both DNA preparation and downstream bioinformatics analysis.

References:

  1. http://dx.doi.org/10.1016/j.gpb.2015.08.002
  2. K. Travers, C.S. Chin, D. Rank, J. Eid, S. Turner. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res, 38 (2010), p. e159
  3. http://dnatech.genomecenter.ucdavis.edu/pacific-biosciences-rs
  4. http://www.pacb.com/smrt-science/smrt-sequencing/read-lengths/