Complete Genomics

Building off advances in nanotechnology, large scale computing and further development of high-throughput instrumentation, BGI-Tech has optimized the performance of the Complete Genomics platform. This is an advanced sequencing platform with industry leading accuracy and sensitive detection of a vast array of variants across the genome. BGI has further developed the platform to operate across a wider range of applications. In addition to whole human genome sequencing, the Complete Genomics system can now be used to perform sequencing studies such as whole exome sequencing and clinical diagnostics testing.

Benefits of the Complete Genomics Platform

  • Superior Sequencing Performance with Proprietary Technologies
    • Industry-Leading Accuracy: Unparalleled accuracy of 1 error in 100,000 bases. This accuracy is enabled by proprietary library construction in conjunction with combinatorial probe-anchor ligation technology  (see Nature Biotechnology paper on platform comparisons)
    • Maximized Imaging Efficiency and minimized reagent usage result in reduced pricing
  • Enhanced Downstream Data Analysis Pipeline and Workflows for Advanced Genome Data Analysis
    • Sensitive Detection of Allele Heterozygosity: With our proprietary de novo variant mapping and assembly software, the ability of this platform to detect allele heterozygosity has been greatly enhanced.
    • Complete Set of Variants Reported: The Complete Genomics Platform is ideal for “finishing” sequencing, reporting the most complete set of variants in this field. Variants reported include SNPs, Indels, CNVs, SVs, and mobile element insertions
  • Improved Turnaround : The development of the 2-adaptor library construction process has reduced sequencing time
  • Broader Range of Capabilities: Whole exome sequencing and clinical diagnostic sequencing have been added to the whole genome sequencing capability
  • Proven and Robust Technology
    • Over 15,000 whole genomes have been sequenced (>45PB of data) using the Complete Genomics platform
    • Dozens of articles using the platform have been published in prestigious journals, such as Nature, Science, and Cell

Applications

Until recent chemistry improvements were implemented, the Complete Genomics sequencing platform was used only for whole genome sequencing, delivering industry-leading accuracy and comprehensive variant detection. We have enabled our customers to realize the benefits of this technology for other applications by developing and providing whole exome sequencing services on this platform. BGI’s services enable our customers to avoid having to heavily invest in the purchase of sequencing instruments, high-performance computing resources and specialized personnel to run the systems and decipher the data. We will soon be expanding its application in other research areas as well.

In addition to its application for basic research, this platform will have an important role in clinical diagnostics. A recently published paper in Nature (Gilissen et al, 2014) demonstrated that whole genome sequencing on the Complete Genomics platform provided a diagnosis for 42 percent of patients with intellectual disability where other tests, including whole exome sequencing and genomic microarrays, failed to yield an answer.

Sequencing Workflow

BGI has brought together diverse technologies to create a comprehensive solution for large-scale studies of multi-omics research based on the Complete Genomics platform. This solution integrates a sequencing platform that combines technological advancements in libraries, arrays, sequencing assay, and high-speed instruments with a suite of base-calling, mapping, assembly, and analysis software, thereby enabling less reagent use and the delivery of high-quality data sets. Here, the whole genome sequencing workflow is shown as an example (Figure 1). There is slight difference in the workflow when the platform is applied to other research applications. Please refer to our service web page for further details.

Figure 1. Complete Genomics Sequencing Technology.

1. Library construction and self-assembling DNA nanoball arrays

Complete Genomics’ DNA libraries consist of DNA fragments with known synthetics DNA sequences (called adaptors) that are interspersed within the genomic DNA at regular intervals. The adaptors act as starting points for reading up to 26 bases from each adaptor-genomic DNA junction.

We use a proprietary library construction process to insert two adaptors into each DNA fragment (Figure 2). This two-adaptor approach supports 52-base reads (26 bases per mate pair). The read length may be increased by inserting more adaptors.

Figure 2. Two Adaptor Library Construction Process.

Sequencing is performed on amplified DNA clusters, which are referred as DNA nanoballs (DNBs). Starting with a small circular DNA template (Figure 3), consisting of approximately 52 bases of genomic DNA and two synthetic adaptors, we are able to generate head-to-tail concatemer, consisting of more than 200 copies of the circular template. We have developed a variety of proprietary techniques to form this concatemer into a ball, as well as to control its size, density and binding affinity to surfaces and other DNBs. One milliliter (mL) of reaction volume generates over 109 DNBs, sufficient for sequencing an entire human genome.

Figure 3. DNA Nanoball Formation.

We produce patterned substrates with two-dimensional arrays of spots that are activated to capture and hold DNBs (Figure 4). When a single DNB sticks to a spot, it repels other DNBs, resulting in at most one DNB per spot. We have developed a proprietary process to yield DNB array occupancies greater than 90%, without adherence of the DNBs to the areas between the spots. Each finished DNA nanoball array contains up to 180 billion bases of genomic DNA prepared for imaging.

Figure 4. DNA Nanoball Array Preparation.

2. Sequencing technology: Combinatorial Probe-Anchor Ligation (cPALTM)

The unique cPALTM technology combines hybridization and ligation to produce industry leading accuracy reads within minimal reagent usage. As reported in Nature biotechnology, one research group at Stanford University compared the sensitivity and accuracy of Complete Genomics and Illumina platforms for single nucleotide variant (SNV) calling (Figure 5). In total, 88.1% (3,295,023 out of 3,739,701) of the unique SNVs were concordant, while they detected 444,678 SNVs by only one platform or the other but not both, of which 345,100 were specific to Illumina (10.5% of the Illumina-combined SNVs) and 99,578 were Complete Genomics-specific (3.0% of the Complete Genomics- combined SNVs). To directly determine accuracy, they sequenced randomly selected concordant and platform-specific regions for Sanger sequencing. They found that 20 of 20 concordant SNVs could be validated, whereas 2 of 15 (13.3%) Illumina-specific and 17 of 18 (94.4%) Complete Genomics -specific SNVs could be validated. This finding suggests Complete Genomics has higher accuracy than does Illumina and that almost all the concordant calls are correct.

Figure 5. Platform comparison for SNV calling

cPAL uses pools of probes labeled with four distinct dyes (one per base) to read the positions adjacent to each adaptor (Figure 6). Each read position has a separate pool of probes. Complete Genomics’ proprietary approach allows 13 contiguous bases to be read from each end of an adaptor.

Figure 6. Combinatorial Probe-Anchor Ligation (cPALTM)

One of the unique advantages of cPAL is random access (independent and non-iterative base reading). Each base-read cycle does not depend on the completeness of any of the previous cycles. This process provides excellent fault tolerance qualities—if a base read fails, it does not prevent interpretation of the remaining reads for that DNB; and, if desired, the failed base can simply be re-assayed.

Another key advantage of independent base reading is the tolerance to low ligation yield per cycle. This tolerance dramatically reduces the required probe and enzyme concentrations, thereby substantially reducing reagent costs. cPAL further allows the ability to read multiple positions per cycle, which is not possible with sequencing by synthesis. Reading multiple positions per cycle decreases the number of cycles, thereby reducing reagent consumption and imaging time.

3. Comprehensive and proprietary data analysis solution

We have developed our own suite of base-calling, mapping, assembly, and analysis software to rapidly reconstruct genomes from billions of mate pair reads, with sensitive variant detection, and accurate annotation across all variation types including SNPs, indels, copy number variations (CNVs), structural variations (SVs), and mobile element insertions (MEIs).

Mapping and assembly

Mapping/assembly is the process by which computers are used to organize all of the overlapping reads to reconstruct a complete genome. We have developed a proprietary approach that uses a combination of advanced data analysis algorithms and statistical modeling techniques to accurately reconstruct over 90% of the complete human genome.

Data analysis

We have developed a suite of analysis tools that enable customers to rapidly analyze the data we generate from their samples. The provided set of open-source tools allow researchers to compare variations between genomes, convert Complete Genomics native file formats to standard formats such as SAM, and perform other genomic data manipulations.

Our biological annotations are derived from a number of public annotation databases including the following:

  • RefSeq alignments in NCBI’s annotation builds for gene and functional annotation of variants: to determine whether a variant overlaps with a particular gene and what functional impact the variant may have (e.g., missense mutation, gene fusion event)
  • miRBase: to identify whether variants overlap with microRNAs
  • COSMIC: to identify variants previously detected in cancer samples
  • dbSNP: to annotate variants that overlap with dbSNP entries, which are annotated if a disease association is known
  • DGV (database of genomic variants): to annotate CNVs

Selected Publications

1.   Genome sequencing identifies major causes of severe intellectual disability. Christian Gilissen et al. Nature (2014) doi:10.1038/nature13394
2.    Exploring the Occurrence of Classic Selective Sweeps in Humans Using whole-genome Sequencing Datasets. Maud Fagny et al. Mol Biol Evol (2014) doi: 10.1093/molbev/msu118
3.    Eating disorder predisposition is associated ith ESRRA andHDAC4 mutations. Huxing Cui et al. J Clin Invest. (2013) 123(11):4706–4713.
4.    Truncating mutations of MAGEL2 cause Prader-Willi phenotypes and autism. Christian P Schaaf et al. Nature Genetics (2013) 45, 1405–1408
5.    Evolutionary History and Adaptation from High-Coverage Whole-Genome Sequences of Diverse African Hunter-Gatherers. Joseph Lachance et al. Cell (2012) 150, 457–469
6.    Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing. Joke Reumers et al. Nature Biotechnology (2012) 30, 61–68
7.    Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing. Jared C. Roach et al. Science (2010) p636-639.