The goal of the modern genomics researcher is not simply to build genomics tools or even to gather genomic data. Instead, the ultimate objective is to further scientific knowledge to better understand human disease, plant and animal genetics, microbial ecology or evolution. The genomics researcher or bioinformatician seeks a deeper understanding of the science; better research tools are only a means to this end. A number of complex steps are involved in genomic sequencing, including platform or vendor selection, sample preparation, sequencing, data management and bioinformatics. Each of these individual processes bears its own learning curve, and a failure at any step could cripple downstream results. Next Generation Sequencing (NGS) innovations have lowered a number of barriers that once stood between the researcher and the creation of new genomic knowledge. Because of these breakthroughs, the bottleneck has shifted from the sequencing step to other aspects of the process. Simplifying the steps of NGS platform selection, sample preparation, and data handling allows the researcher to focus on analysis and discovery; to focus on science, rather than on the tools of science.
Focus on Science, not Tools
Whether in business, manufacturing or scientific research, success is made possible by alignment of resources toward an identified goal. Each step in a process takes time and money, and a wrong step wastes these resources without achieving results. The most efficient assembly line in a manufacturing plant optimizes each production step to eliminate bottlenecks, leaving critical or tedious steps only to the most skilled workers. This maximizes the efficiency, speed, and quality by which a product is produced. For the genomics researcher, data is only the raw material, but knowledge is the product being created. Therefore, tools that simplify front-end or back-end steps free the researcher to focus on knowledge output, such as discovery, learning or creative thinking. Next-Generation Sequencing (NGS) has already made the cutting-edge technologies of the prior decade more mainstream and affordable, even for small research laboratories. The NIH National Human Genome Research Institute has reported a faster-than-Moore's-Law decline in genetic analysis costs since the year 2000 . In other words, the revolutionary acceleration of output potential for the computing and telecommunications industries has been outpaced by that of the current genomics industry.
Moore's Law predicts that the speed of integrated circuits will double every two years. This bold prediction has become a metric for success in the computer industry and beyond, a benchmark that has been surpassed by innovations in next-generation sequencing (NGS) for the past 15 years . Exponential improvements in efficiency in NGS technology is driving new opportunities in human, AgBio, and consumer genomics industries .
Next-generation Sequencing is a process beyond the physical sequencing
Despite innovations in technologies, instrumentation, and analytics, NGS still involves a number of complex steps that require deft scientific skills in order to be successful. Although much of downstream work is digital, existing as a string of nucleotides that reside virtually within the computing cluster, initial sequencing steps are confined to the physical world where they are subject to human error. Genetic material from a biological sample must first be isolated and prepared in the laboratory. Even at this initial step, many strategic questions must be addressed, each having downstream impacts on turnaround, costs, and quality.
Which platform is best for my needs? Does my local core lab have this expertise? How do I prepare a sample library for the chosen platform?
Vendor selection - More Options, More Questions...
- Individual sequencing vendors often have only one platform or specialty. With rapid advances in NGS instrumentation, researchers have come to understand the importance of outsourcing NGS projects rather than buying instrumentation that quickly becomes obsolete. However, loyalty to a single NGS service such as the local core facility may put you right back in the same situation. Your chosen service provider must remain current in order for YOU to remain current.
- All sequencer machines are not the same. As NGS innovations abound, NGS applications are also expanding rapidly. Human applications in oncology, forensics or consumer genetics have certain requirements, while projects for small-genome species, microbiomics or agrigenomics have different needs. Throughput is a key parameter, with larger projects requiring the highest-throughput instruments. Read Length of a sequencer also has important ramifications on efficiency and quality, with de novo sequencing benefitting from longer reads. Other sequencing instrument parameters to consider include total output per run, coverage and coverage uniformity, and price per run.
- Sequencers are more cost-efficient when running at full capacity. Many of the reagent and consumable costs are fixed for a sequencing run, and running a single lane on an eight-lane flow cell, or otherwise not filling up the flow cell of an instrument, is wasteful. NGS vendors with lower volume might wait until enough orders are received to perform at full capacity. This delay, of course, affects turnaround times. Filling spot vacancies similar to filling empty seats on a plane reduces per-sample costs for the operator and therefore provides opportunities for better pricing for the researcher.
- Simply gathering price quotes is complicated. The many types of NGS platforms and projects, coupled with rapid expansion in both these areas, has made pricing difficult to standardize. A single vendor will quote its best price based on your project and on the platform available but is not compelled to educate you about more cost-effective approaches to another vendors or platforms. Gathering and comparing quotes, converting units of throughput, read length, coverage, turnaround, quality, and price can be daunting.
- Pilot studies are a resource drain. uDue to complexities in vendor selection, researchers may resort to a pilot study, which bears its own costs and problems. One researcher described the issue, How much time do you spend doing pilots and drafting statements of work? The analysis even on pilots is time-consuming, so you have to balance between benchmarking and getting actual work done.
Vendor selection questions that must be addressed:
- What is the NGS platform best suited for my goals?
- Does my local core facility excel at this?
- Does another NGS sequencing facility have more expertise?
- Do they currently have the capacity for my project?
- Can they meet my turnaround needs?
- Should I perform a pilot study first?
- How do I engage a new vendor?
- How do I compare pricing between the many NGS vendors?
Once an NGS platform and vendor has been selected, a genomic sample must be carefully prepared at the hands of the researcher. This brings up an entirely new set of pitfalls, including additional questions about methods, costs, and quality.
Library Construction with Initial Quality Assessment
- Library construction is the single biggest source of error. While errors in NGS data may arise from any step during sample/library preparation, sequencing, or data analysis, advances in the tools for sequencing and data analysis have dramatically reduced variables from these steps. A recent web poll6 indicated that 48% of NGS respondents identify Sample/Library Prep as the most problematic step for quality control (Sequencing received zero responses). Sample prep is the step that most involves human hands and human decision. Each NGS platform requires different adaptors and results in different read lengths, and creating a template library with fragment sizes not optimized to the platform can leave gaps in the genome, reducing the quality of alignments.
- Library construction is a profit center for most NGS providers. Because sample prep has many variables and has serious impacts on quality of NGS data, vendors capitalize on this by charging for sample prep services. Again, it is not in the best interest of the NGS provider to educate its customers on how to perform this step better, so researchers often pay a premium for something that could be handled in their own hands at a lower cost.
- No up-to-date standards or guidelines are available. Each vendor, instrument manufacturer or sample prep company is selling its own product, and any neutral buyer's guide written only a couple of years ago is likely to be obsolete. Library construction methods, adapter tagging, targeted enrichment, amplification and fragment size analysis all are factors that affect downstream results.
Library construction questions that must be addressed:
- Which library preparation method is best for my needs?
- What is the correct insert size and how will it be checked?
- What are my turnaround needs?
- What prep scale do I need?
- What are my cost concerns?
- How do I assess the quality of my library?
When the early steps of vendor selection and sample preparation are complete, the researcher switches from the human aspects of the process to machine aspects: sequencing, data storage, and analytics. Innovations in NGS technology have virtually eliminated bottlenecks due to the sequencing instrument itself, provided that the appropriate platform has been selected for a particular research goal. Therefore, optimizing the front-end and back-end steps of the sequencing process has become the new frontier for innovation.
After sequencing a researcher has to interpret the sequence. This process is computationally expensive and complex. Typically, institutions have relied on in-house servers and high-performance clusters to save the raw data and to run bioinformatics pipelines. There are many different computational pipelines that can be run at this stage and they depend on the goal of the researcher:
- Re-mapping: A process when short reads are aligned to a reference genome
- De-novo assembly: assembling reads into a novel genome without any prior knowledge of the order, orientation, repeating elements in the DNA.
- Expression analysis is where a researcher compares the expression of mRNA between samples.
- Metagenomic analysis is performed to understand the population complexity of a sample, or how many different organisms are in a sample.
I will discuss the details of each application in a separate post, but each bioinformatics analysis can be broken into 3 parts:
- Primary analysis: this evaluates the quality of the reads and a typical report should indicate if the reads can be used for the downstream application.
- Secondary analysis is the mapping to the reference genome, the assembly of a de-novo genome, quantification of transcripts etc.
- Tertiary analysis turns analysis from secondary analysis into an actionable insights
There are many different analysis packages that can perform these different tasks. One of the main bottlenecks in the analysis is installing the correct versions of software i.e. packages that can speak with data generated from the sequencing machine and downstream analysis. A common pitfall is trying to install all pieces of software on the same machine. This leads to collisions in packages, resource allocation, and when many people depend on the same cluster, a lot of downtime. In the recent years, there has been a movement to run genomics in the cloud. I will discuss more cloud infrastructure in a future post.
- Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed August 19, 2015.
- JP Morgan Life Sciences 2015 Outlook.
- Sullivan, Arthur; Steven M. Sheffrin (2003). Economics: Principles in action. Upper Saddle River, New Jersey 07458: Pearson Prentice Hall. p. 153