We provide the highest standards of quality control and data delivery:
- Spectroscopic and Fluorometric quantification of library samples
- Reducing overclustering by performing qPCR quantification on multiplexing samples.
- Bioanalyzer / TapeStation analysis of constructed libraries to validate insert size.
- Quality control of the data using FastQC
- Data delivery and automatic back-up in the secure Amazon cloud
Our goal is to guarantee the highest quality data with competitive turnaround time and lowest costs. Here we describe the methods we use in this product.
Our service helps you choose the best sequencing solution for our constructed library. Our commitment is high quality so therefore we ask that you provide the following when submitting a library:
Sample QC Requirements for Submitted samples:
Input DNA requirements for supported constructed library sequencing
- Separate tubes for each index, please do not mix individual indexes.
- A CSV file listing the relationship between sample names and barcode pairs. See example below.
- Optional: Bioanalyzer or tapestation traces of libraries. If you are not able to provide these, we will run this QC step at an additional cost. You will always have a choice to approve library quality.
Clustering & Sequencing
Illumina utilizes a unique "bridged" amplification reaction that occurs on the surface of the flow cell. A flow cell containing millions of unique clusters is loaded into the HiSeqX10 for automated cycles of extension and imaging. HiSeqX10 is by far the most powerful sequencing platform available to date.
To confirm library quantification prior to clustering, qPCR is performed on multiplexed samples and a Bioanalyzer run is used to measure library size distribution after multiplexing at no additional cost.
Sequencing-by-Synthesis utilizes four proprietary nucleotides possessing reversible fluorophore and termination properties. Each sequencing cycle occurs in the presence of all four nucleotides leading to higher accuracy than methods where only one nucleotide is present in the reaction mix at a time. This cycle repeats, one base at a time, generating a series of images each representing a single base extension at a specific cluster.
Primary Processing using FastQC
We integrated FastQC into our user interface to evaluate the quality of short reads before mapping to a reference genome. This tool is the standard in academia and industry and generates 10 plots with 11 metrics to help filter out bad reads before mapping to the genome and calling variants. Specifically we will provide the following metrics.
|Plot Generated||Plot Reporting Description||Pass Indicators||Fail Indicators|
|Per base call sequence quality||Distribution of values per each base call throughout read: median, median quality scores.||> 20||< 20, Indicator trimming may be required.|
|Per sequencing quality score||Mean sequence quality (phred) distribution plot.||> 20 High quality distributions.||< 20 distributions, flow cell has problem.|
|Per base sequence content||Position in read sequence ratio.||Parallel lines, even distribution of four nucleotides.||Differences of more than 20% between any of the bases.|
|Per sequence GC content.||Distribution of GC content across all sequences.||Theoretical (normal) distribution.||Spikes indicate contamination in library like adapters. Broad peaks can mean cross-contamination|
|Per base N content||Position in read uncalled base content.||No indicators.||When more than 5%. Trim N-rich reads if near 5` or 3`.|
|Sequence length distribution||Library length consistency.||Read are the same length in the library.||Varying lengths, usually after quality trim.|
|Sequence duplication levels||Sequence duplication level indicates how many reads represented more than once.||Low % duplication.||High % duplication, can indicate PCR enrichment or biological enrichment|
|Overrepresented sequence||List of over-represented sequences.||No known contaminants.||Known contaminants. (illumina PCR Primers).|
|Adapter content||Detect adapter sequences in read through||No adapter read through||Insert size is shorter than the read length. Apply adapter trimming|
|K-mer Content||Position in read K-mer content.||Low K-mer enrichment.||Biases of K-mers in reads, could indicate dimers or adapters|
Optional Bioinformatics - Broad Institute best-practices GATK
Provided the data passes our quality control, we run the Broad Institute best-practices GATK v.3 to generate the VCF file.
|pre-processing||cut-adapt||Remove adapter sequences. Splitting FASTQ files containing multiple samples using FASTQ/A Barcode splitter. Filtering, trimming and masking nucleotides based on quality: FASTQ Quality filter, trimmer and masker.|
|Alignment Mapping||BWA-0.7.12 samtools 1.2 Picard 1.134||Alignment of reads to NCBI genome build 37 using BWA-MEM. Fix mated pairs on SAM file (samtools). Sort and convert to BAM (picard). Remove duplicate reads (picard).|
|Reducing Artifacts||GATK 3.4-46||Realign Indels (RealignerTargetCreator, IndelRealigner).|
|Variant Calling||GATK 3.4-46||Single-sample variant calling (HaplotypeCaller GVCF mode). Joint genotyping of individual HaplotypeCaller calls (GenotypeGVCFs).|
Data delivery in the AWS cloud.
All of our data is on the cloud and is backed up automatically. We provide two mechanisms for retrieving your data:
- Cross-bucket S3 access - you can run all of your analysis in the cloud by simply pointing to your S3 bucket.
- Rapid managed File transfer via our user interface. The rapid file transfer utilizes Globus managed file transfer for securing your download even if connectivity is lost.
We are ready to help you set up your analysis pipeline in the cloud. If you are interested please contact firstname.lastname@example.org to get an explanation of services.