Human Whole
Exome Sequencing
(WES) at 50X

Our service uses industry best practices to combine all necessary steps for Human Whole Exome Sequencing (WES), from DNA to a Variant Call Format (VCF) file:

  1. Exome capture using Agilent SureSelectXT Human All Exon V5.5 capturing > 540K probes on reference hg19.
  2. Extensive Quality Control of the library preparation
  3. High-throughput Sequencing using the HiSeq 2500 Rapid Run platform with 2x100bp reads
  4. Quality control of the data using FastQC
  5. Bioinformatics - fastq to VCF using Broad Institute best-practices GATK v.3.
  6. Data delivery and automatic back-up in a high-security Amazon cloud

Our goal is to guarantee the highest quality data with competitive turnaround time and lowest costs. Here we describe the methods we use in this product.

Exome Capture

Agilent SureSelectXT kits combine two processes:

  1. Illumina-compatible DNA sequencing library construction
  2. Targeted enrichment by hybrid capture using 120-mer biotinylated cRNA baits

An overview of the SureSelectXT workflow can be found here:

The complete details of Agilent’s SureSelectXT protocol are available here:

Sample and Library QC Requirements

Library Kit Shearing Input DNA (Total) Input DNA (Cntr) Target Insert Size
Agilent SureSelectXT Human All Exon V5 Covaris 1.0 ug At least 50 ng/ul 150-200bp
Table 1. Shearing, input DNA requirements and insert size for SureSelectXT V5 libraries.

Sample QC Requirements for Submitted samples:

A total of at least 1 ug of DNA is required for the construction of libraries using the Agilent SureSelectXT Human All Exon V5 kit. The concentration of submitted DNA should be at least 50 ng/ul.

We recommend using a fluorometric quantitation method (e.g. Qubit) in order to strictly measure the concentration of double-stranded DNA. If quantitation is performed on a spectrophotometer (e.g. NanoDrop), we recommend submitting twice the required amount (i.e. 2 ug at a concentration of at least 100 ng/ul as measured by the spectrophotometer).

(Note: the official Agilent recommendation is an input of 3 ug, but we have found 1 ug to be sufficient.)

Library QC for Submitted Libraries:

To confirm library quantification prior to clustering, qPCR is performed on multiplexed samples and a Bioanalyzer run is used to measure library size distribution.

We provide the bioanalyzer traces for insert size distribution before running the sample on the sequencer.

Clustering & Sequencing

Illumina utilizes a unique "bridged" amplification reaction that occurs on the surface of the flow cell. A flow cell containing millions of unique clusters is loaded into the HiSeq 2500 Rapid Run for automated cycles of extension and imaging.

Sequencing-by-Synthesis (SBS) chemistry on the HiSeq 2500 utilizes four proprietary nucleotides possessing reversible fluorophore and termination properties. Each sequencing cycle occurs in the presence of all four nucleotides leading to higher accuracy than methods where only one nucleotide is present in the reaction mix at a time. This cycle repeats, one base at a time, generating a series of images each representing a single base extension at a specific cluster.

Primary Processing using FastQC

We integrated FastQC into our user interface to evaluate the quality of short reads before mapping to a reference genome. This tool is the standard in academia and industry and generates 10 plots with 11 metrics to help filter out bad reads before mapping to the genome and calling variants. Specifically we will provide the following metrics.

Plot Generated Plot Reporting Description Pass Indicators Fail Indicators
Per base call sequence quality Distribution of values per each base call throughout read: median, median quality scores. > 20 < 20, Indicator trimming may be required.
Per sequencing quality score Mean sequence quality (phred) distribution plot. > 20 High quality distributions. < 20 distributions, flow cell has problem.
Per base sequence content Position in read sequence ratio. Parallel lines, even distribution of four nucleotides. Differences of more than 20% between any of the bases.
Per sequence GC content. Distribution of GC content across all sequences. Theoretical (normal) distribution. Spikes indicate contamination in library like adapters. Broad peaks can mean cross-contamination
Per base N content Position in read uncalled base content. No indicators. When more than 5%. Trim N-rich reads if near 5` or 3`.
Sequence length distribution Library length consistency. Read are the same length in the library. Varying lengths, usually after quality trim.
Sequence duplication levels Sequence duplication level indicates how many reads represented more than once. Low % duplication. High % duplication, can indicate PCR enrichment or biological enrichment
Overrepresented sequence List of over-represented sequences. No known contaminants. Known contaminants. (illumina PCR Primers).
Adapter content Detect adapter sequences in read through No adapter read through Insert size is shorter than the read length. Apply adapter trimming
K-mer Content Position in read K-mer content. Low K-mer enrichment. Biases of K-mers in reads, could indicate dimers or adapters

Bioinformatics - Broad Institute best-practices GATK

Provided the data passes our quality control, we run the Broad Institute best-practices GATK v.3 to generate the VCF file.

GATK Analysis Tools Description
pre-processing cut-adapt Remove adapter sequences. Splitting FASTQ files containing multiple samples using FASTQ/A Barcode splitter. Filtering, trimming and masking nucleotides based on quality: FASTQ Quality filter, trimmer and masker.
Alignment Mapping BWA-0.7.12 samtools 1.2 Picard 1.134 Alignment of reads to NCBI genome build 37 using BWA-MEM. Fix mated pairs on SAM file (samtools). Sort and convert to BAM (picard). Remove duplicate reads (picard).
Reducing Artifacts GATK 3.4-46 Realign Indels (RealignerTargetCreator, IndelRealigner).
Variant Calling GATK 3.4-46 Single-sample variant calling (HaplotypeCaller GVCF mode). Joint genotyping of individual HaplotypeCaller calls (GenotypeGVCFs).

Data delivery in the AWS cloud.

All of our data is on the cloud and is backed up automatically. We provide two mechanisms for retrieving your data:

  1. Cross-bucket S3 access - you can run all of your analysis in the cloud by simply pointing to your S3 bucket.
  2. Rapid managed File transfer via our user interface. The rapid file transfer utilizes Globus managed file transfer for securing your download even if connectivity is lost.

We are ready to help you set up your analysis pipeline in the cloud. If you are interested please contact to get an explanation of services.


Your job is running!




Something went wrong!