Background:

Current Next-Generation Sequencing (NGS) platforms employ massively parallel, automatable sequencing approaches designed for maximum output and efficiency. With vastly improved speed and efficiency, modern NGS platforms have enabled an entirely new paradigm for genomic research, with new applications in human disease, small-genome species, metagenomics, consumer testing and more. Early NGS innovations (e.g., Roche/454, SOLiD) have been outpaced by the more recent platforms of Illumina, PacBio and Ion Torrent. Because of the different chemistries and detection strategies employed, these platforms exhibit varied performance parameters such as throughput, read length, error rates and cost per run. Larger NGS studies may benefit from the higher throughput and accuracy of Illumina, de novo sequencing may perform better with the longer read length allowed by PacBio, and smaller laboratories may benefit from the cost-efficiency of Ion Torrent [1]. In this post, we focus on the Illumina platform, describing how each step in the process affects quality of results.

Illumina Overview

While many NGS innovations have driven the rapid growth in the NGS industry, no platform has had a greater impact on the entire field than Illumina, including the Genome Analyzer (acquired from Solexa), HiSeq and MiSeq instruments. Illumina systems use an approach described as Sequencing-by-Synthesis (SBS). Sample libraries are created by digestion of the sample DNA, followed by ligation of Illumina specific adapters that allow capture and amplification of localized clusters suitable for imaging during each reaction cycle [2]. Finally, the output of a sequencing run is a bcl file that is typically converted to a FASTQ file which is a list of reads and quality scores for the confidence in each read (more below). 

Paired-end (PE) Sequencing is typically employed with Illumina, whereby both ends of a fragment are used for sequencing, then aligning the forward and reverse data as read pairs. This not only has the benefit of doubling the number of reads without additional labor, but it also improves accuracy, alignments and indel detection. paired end sequencing

Another common strategy used on the Illumina platform for improving throughput and efficiency is Multiplexing, whereby libraries are pooled into a single sample using indexing adapters that allow sorting of the libraries after sequencing. This approach reduces repetitive labor and costs, and leverages computing power to delineate the downstream data. Illumina Multiplexing

Illumina strengths include high output and accuracy with low cost per run. These advantages have no doubt led to the platform’s dominant share of more than 70% of the NGS market [2]. However, the platform is not without its weakness, and knowledge of common problems and solutions can improve your NGS results. Illumina allows a shorter read length compared to PacBio, a parameter that becomes more important when there is no reference genome [3]. Illumina’s requisite cluster position determination suffers difficulties from low diversity samples when libraries are created from amplicons or specific restriction digestion Illumina also has problems resolving genome regions with a large number of repeats, CG-rich content or genes with multiple homologous regions [3] . 

Illumina Process and Pitfalls

The Illumina NGS process involves DNA extraction and purification, Library Preparation, Cluster Generation, Sequencing and Data Analysis. Below we describe Library preparation through sequencing, and devote future blog posts to DNA extraction and data analysis. 

Library Preparation

Library Preparation involves digesting the sample DNA into ~200-800 base pair fragments, followed by ligation of Illumina specific adapters to each fragment to allow capture to the flow cell tiles for sequencing. Libraries created by random fragmentation of genomic DNA will ideally create a sample with equal proportions of the four nucleotides (A, C, G, T). This allows downstream clusters to be easily distinguished by the software [3].

Tips on Multiplexing

Multiplexing of libraries is also done at this first stage by incorporating an index sequence into each fragment, then pooling the libraries to be sequenced in parallel with one another. It is important that multiplexing sequences are compatible with the chosen platform, so often it may be best to leave this step to the laboratory performing the actual sequencing.

Cluster Generation

Once the libraries are pooled the next step is Cluster Generation, whereby the captured fragments are amplified into clonal clusters. The Illumina software uses the first four cycles of sequencing to determine cluster positions on a tile, based on the uniqueness of each sequence. One of the key problems that may arise in these first steps is a Low-Diversity Sample, which can lead to lower yields and lower quality scores. A key metric for cluster density is the Q30 score (see below).

Tips on Low Diversity Samples

Libraries created through amplicon generation or specific restriction digest (as opposed to random fragmentation) may introduce non-random distribution of the initial bases. With such Low-Diversity Samples [3], the Illumina software has trouble differentiating clonal clusters (see below), leading to higher phasing numbers and rapid fall off. Low-Diversity Sample problems are typically remedied using a couple of approaches:

  • Spiking in a higher diversity sample is one way to improve distribution of the nucleotides, allowing the software to focus different clusters more effectively. As much as 50% of spiked sample can be used, although this reduces the number of relevant reads by half [3].
  • Another approach is to incorporate diverse amplicon primers with randomly distributed nucleotides upstream from the target. This method will reduce the read length by the number of bases in the primer, which can be problematic for de novo sequencing where longer read lengths are beneficial.
  • Phasing number is the rate at which molecules in a cluster loose sync with each other, pre-phasing is jumping ahead one molecule and phasing is falling behind one molecule. The higher a phasing number is, the higher the signal to noise ratio is, so a low phasing number is desired.

Sequencing and Run Quality Metrics A key feature of the Illumina platform is its Real-time Analysis (RTA) Software, which operates during a sequencing run to perform base calls and quality scoring. Performing quality scoring after each chemistry and imaging cycle reduces downstream processing and improves base calling. Understanding the success and failure metrics used by the Illumina system can help improve your results. Illumina Run Quality Metrics

  • Chastity - Illumina software applies an internal quality filtering algorithm called a chastity filter: Chastity = (brightest base intensity) / [(brightest base intensity) + (second brightest base intensity)]
  • PF - Successful sequencing reads that pass the chastity filter are defined as “pass filter,” or just PF PF = no more than 1 base call has a Chastity value below 0.6 in first 25 cycles
  • Q scores and Q30 - Q scores estimate the probability that a base is called incorrectly. Q30 (a threshold for accuracy where the Q score is 30, inferring a base call accuracy of 99.9% [4]. Q=-10log10(e) Q30 = the Q score gives a value of 30 (base call accuracy of 99.9%)

Making Sense of FASTQ files - [Oren will add content] Sequencing data output is massive in scope, and the NGS industry has standardized the file format to store and read such big data. Understanding the language of FASTQ is important for downstream processing of data, including mapping to a reference genome. What is FASTQ? FastQ is a text format with 4 lines of data per sequence:

  1. Sequence identifier: contains the sequence ID. For Illumina machines, it indicates a unique instrument name, a flowcell lane number, a tile number with in the flowcell lane, x and y coordinates of the cluster in the tile, an index sequence (if multiplexing), an indication of mate pairs, and a Y/N for read filtering (Y did not pass filtering).
  2. The sequence
  3. Quality score identifier line. You wouldn't do anything with this. I know, boring.  
  4. Phred Quality scores represented by ASCII encoding. Read more about Phred quality score here, this score is used multiple times in next generation sequencing and we will encounter it in read mapping and variant calling, so you should bite the bullet and learn about logs. 

In the next post I will explain how to evaluate FASTQ files when analyzing data for Whole Human Sequencing (WGS), Exome Sequencing and metagenomic sequencing.   References:

  1. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers Michael A Quail et al, BMC Genomics 2012 13:341, Published: 24 July 2012
  2. http://www.illumina.com/content/dam/illumina-marketing/documents/products/illumina_sequencing_introduction.pdf
  3. http://www.illumina.com/documents/products/technotes/technote-hiseq-low-diversity.pdf
  4. http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf)