Exome Sequencing

Exome sequencing is an innovative technique that selectively sequences the coding regions of a genome. The exome constitutes about 1% of the human genome, which translates to 30-Mb in length. It is the DNA sequence that is translated into protein, and for disease risk, it is generally the most functionally relevant. Exome sequencing can be used to identify novel genes associated with rare Mendelian disorders and common complex disorders, such as cancer, diabetes, and obesity.


BGI provides two exome sequencing strategies whereby targeted genome-wide enrichment of functional regions is achieved either by PCR amplification or hybridization capture. Both methods have been used successfully to identify genomic variations in large-scale complex disease studies. In targeted-region capture, targeted genomic regions are enriched by microarray hybridization (NimbleGen Sequence Capture array) or solution hybridization (Agilent Sure-Select system). The enriched templates then undergo massively parallel sequencing. 

Workflow of exome sequencing 

The figure above illustrates the basic process of exome sequencing. A human genomic DNA sample is randomly fragmented by sonication or nebulization to an average size of 500 bps, and a pair of linkers is ligated to both ends of the DNA fragments. The fragmented DNA is then hybridized to the NimbleGen 2.1M Human Exome Array, after which the enriched exome DNA fragments are eluted from the array and undergo random ligation. The ligated long exome-enriched DNA is again randomly fragmented and then ligated with Illumina-compatible adapters and enriched by PCR. Finally, the enriched DNA undergoes Illumina GA sequencing. 


Bioinformatics analysis of exome sequencing includes clean-read alignment with a reference genome, a histogram of depth distribution in target regions, sequencing depth and coverage of each CCDS exome, evenness of exome capture sequencing, consensus genotype calling and SNP detection in exome, and annotation of the resulting SNPs.

Standard bioinformatics analysis: (1) Base calling, linker and adapter filtering; (2) Alignment (the reference sequence should be provided);  (3) QC report of capture and sequencing, which includes the following:

  1. Proportion of reads in target regions
  2. Summary of data production
  3. Sequencing depth distribution per base
  4. Coverage distribution of target regions in samples
  5. Distribution of the mismatch rate in samples
  6. Distribution of target-region sequencing depth at different read quality thresholds
  7. Distribution of the mismatch rate in alignments in bases with different quality scores (excluding known alignments at SNP sites from dbSNP database).
  8. Matching of known phenotype (gender etc.)
  9. Ancestry clustering to exclude non-genuine ethnic individuals

Advanced bioinformatics analysis: SNP calling (including information on consensus sequences); SNP annotation and statistics; Summary of SNPs; Candidate gene sets associated with disease (control sample and the information of the case should be provided); detection, annotation, and statistics of indels.