Our service uses the best practices combining all the necessary steps for Whole Human Genome sequencing, from DNA to a VCF file:
- Library construction using TruSeq PCR Free or Nano
- Extensive Quality Control of the library preparation
- High-throughput Sequencing using HiSeqX10
- Quality control of the data using FastQC
- Bioinformatics - fastq to VCF using Broad Institute best-practices GATK v.3.
- Data delivery and automatic back-up in a high-security Amazon cloud
Our goal is to guarantee the highest quality data with competitive turnaround time and lowest costs. Here we describe the methods we use in this product.
We provide two library preparation protocols for high and low amounts of DNA.
TruSeq PCR Free - Preferred approach
The TruSeq PCR Free kit eliminates PCR amplification steps in the standard TruSeq workflow to reduce PCR-induced library bias. This allows for better coverage of GC-rich and repetitive regions that are traditionally difficult to sequence. A large amount of input DNA is required in comparison to other kits. TruSeq PCR Free is the second kit currently supported on Illumina’s HiSeq X Ten platform (but only for the 350bp insert size).
TruSeq Nano - small amounts of DNA
The TruSeq Nano kit modifies Illumina’s industry-standard TruSeq sample prep to accommodate samples with limited input DNA available. TruSeq Nano was the first kit supported on Illumina’s HiSeq X Ten platform (but only for the 350bp insert size).
Sample and Library QC Requirements
Sample QC Requirements for Submitted samples:
Input DNA requirements for supported library methods are detailed in Table 1. Concentration (in ng/ul) by a fluorometric-based quantification method (e.g. Qubit or PicoGreen). Fluorometric quantitation is recommended to measure double-stranded DNA, not skewed by other nucleic acids present in the sample. Measure by a NanoDrop or equivalent, acceptable 260/280 ratios are between 1.8-2.0 for submitted DNA samples.
Library QC for Submitted Libraries:
To confirm library quantification prior to clustering, qPCR is performed on multiplexed samples and a Bioanalyzer run is used to measure library size distribution.
|Library Kit||Shearing||Input DNA||Target Insert Size||Includes PCR|
|TruSeq PCR Free||Covaris||1-2ug||350bp or 550bp||No|
|TruSeq Nano||Covaris||200ng||350bp or 550bp||Yes|
Clustering & Sequencing
Illumina utilizes a unique "bridged" amplification reaction that occurs on the surface of the flow cell. A flow cell containing millions of unique clusters is loaded into the HiSeqX10 for automated cycles of extension and imaging. HiSeqX10 is by far the most powerful sequencing platform available to date.
Sequencing-by-Synthesis utilizes four proprietary nucleotides possessing reversible fluorophore and termination properties. Each sequencing cycle occurs in the presence of all four nucleotides leading to higher accuracy than methods where only one nucleotide is present in the reaction mix at a time. This cycle repeats, one base at a time, generating a series of images each representing a single base extension at a specific cluster.
Primary Processing using FastQC
We integrated FastQC into our user interface to evaluate the quality of short reads before mapping to a reference genome. This tool is the standard in academia and industry and generates 10 plots with 11 metrics to help filter out bad reads before mapping to the genome and calling variants. Specifically we will provide the following metrics.
|Plot Generated||Plot Reporting Description||Pass Indicators||Fail Indicators|
|Per base call sequence quality||Distribution of values per each base call throughout read: median, median quality scores.||> 20||< 20, Indicator trimming may be required.|
|Per sequencing quality score||Mean sequence quality (phred) distribution plot.||> 20 High quality distributions.||< 20 distributions, flow cell has problem.|
|Per base sequence content||Position in read sequence ratio.||Parallel lines, even distribution of four nucleotides.||Differences of more than 20% between any of the bases.|
|Per sequence GC content.||Distribution of GC content across all sequences.||Theoretical (normal) distribution.||Spikes indicate contamination in library like adapters. Broad peaks can mean cross-contamination|
|Per base N content||Position in read uncalled base content.||No indicators.||When more than 5%. Trim N-rich reads if near 5` or 3`.|
|Sequence length distribution||Library length consistency.||Read are the same length in the library.||Varying lengths, usually after quality trim.|
|Sequence duplication levels||Sequence duplication level indicates how many reads represented more than once.||Low % duplication.||High % duplication, can indicate PCR enrichment or biological enrichment|
|Overrepresented sequence||List of over-represented sequences.||No known contaminants.||Known contaminants. (illumina PCR Primers).|
|Adapter content||Detect adapter sequences in read through||No adapter read through||Insert size is shorter than the read length. Apply adapter trimming|
|K-mer Content||Position in read K-mer content.||Low K-mer enrichment.||Biases of K-mers in reads, could indicate dimers or adapters|
Bioinformatics - Broad Institute best-practices GATK
Provided the data passes our quality control, we run the Broad Institute best-practices GATK v.3 to generate the VCF file.
|pre-processing||cut-adapt||Remove adapter sequences. Splitting FASTQ files containing multiple samples using FASTQ/A Barcode splitter. Filtering, trimming and masking nucleotides based on quality: FASTQ Quality filter, trimmer and masker.|
|Alignment Mapping||BWA-0.7.12 samtools 1.2 Picard 1.134||Alignment of reads to NCBI genome build 37 using BWA-MEM. Fix mated pairs on SAM file (samtools). Sort and convert to BAM (picard). Remove duplicate reads (picard).|
|Reducing Artifacts||GATK 3.4-46||Realign Indels (RealignerTargetCreator, IndelRealigner).|
|Variant Calling||GATK 3.4-46||Single-sample variant calling (HaplotypeCaller GVCF mode). Joint genotyping of individual HaplotypeCaller calls (GenotypeGVCFs).|
Data delivery in the AWS cloud.
All of our data is on the cloud and is backed up automatically. We provide two mechanisms for retrieving your data:
1. Cross-bucket S3 access - you can run all of your analysis in the cloud by simply pointing to your S3 bucket.
2. Rapid managed File transfer via our user interface. The rapid file transfer utilizes Globus managed file transfer for securing your download even if connectivity is lost.
We are ready to help you set up your analysis pipeline in the cloud. If you are interested please contact firstname.lastname@example.org to get an explanation of services.