BGI is constantly developing the hardware and software to handle the huge volumes of bioinformatics data using HPC (High Performing Computing) and Cloud computing, and aims to process massive amounts of data with lower cost and higher efficiency to help researchers from all over the world facilitate their research work.
Figure 1. Improvement of computing & storage capability in BGI
Two Centers of Cloud Computing in BGI
Currently, there are two large centers in BGI’s cloud computing platform — Bio-Data Centre (CliMB) and Biol-Cloud Computing centre (BGI Cloud).
Bio-data Centre – CliMB
CLiMB is a bio-data center that was independently designed by BGI. The core features include data service, sequence alignment and many others. Building a world-class bio-data center will provide the data support for the health and sustainable development of life science industry and provide high-quality data for researchers from the biology field. CLiMB also supports the data service for Giga Science, a journal published by BGI in partnership with BioMed Central.
Bio- Cloud Computing Centre – BGI Cloud
BGI Cloud applies advanced technology for cloud storage and cloud computing, integrating the common data and valuable, specific sequencing data of BGI in the field of genomics. BGI also deploys standard bioinformatics analysis pipelines and many kinds of alignment and analysis software to build a set of easy-to-use, flexible, one-stop automatic services for mass data storage and computing. Users are able to remotely conduct the sequencing data processing through the network and get access to the bioinformatics analysis resource anytime anywhere. By integrating powerful computing hardware and rich application software, BGI Cloud provides an end-to-end computing service solution to users around the world.
BGI Cloud Workflow
Figure 2. BGI Bio-Cloud Workflow
Cloud-Based Software in BGI
BGI has developed two new cloud-based software-as-a-service offerings for next-gen data analysis called Hecate and Gaea. They are “flexible computing” solutions for de novo assembly and genome re-sequencing.
Hecate is based on a series of distributed algorithms on map/reduce framework to recognize and simplify non-branching repeat-free regions of the genome, correct errors and resolve the ambiguous bubbles and short repeats. Together with the distributed graph shrinkage algorithms to construct a linear DNA sequence, Hecate offers greatly reduced running time and hardware cost for short sequence assembly. For example, using 96 Hecate cores, the genome coverage would increase to 84% in 42 hours at a price of $60,000, a savings of 28 hours and $90,000 compared to SOAPdenovo running on a single server (Table 1).
|Human Genome||SOAP de novo||Hecate (48 cores)||Hecate (96 cores)|
|Assembly Time||70 Hours||76 Hours||42 Hours|
|Number of Servers||1||6||12|
|Memory Size||500 GB x 1||24 GB x 6||24 GB x 12|
|Hardware Price||USD 150,000||USD 30,000||USD 60,000|
Table 1. The performance comparisons of Hecate & SOAPdenovo
Gaea is designed to distribute re-sequencing computation to a cluster of nodes based on the Hadoop Streaming framework with personalized algorithm interfaces for running the SOAP2, BWA, SAMTools, SOAPsnp, DIndel, and BGI’s realSFS. Gaea is designed to distribute re-sequencing computation to a cluster of nodes to greatly reduce the running time and hardware cost. For the current version of Gaea (v 1.2), speed increases of 75x for SOAP2 and 90x for BWA using 100 cores. At 400 cores those numbers rose to 300x and 346x speed increases compared to running either algorithm on a single core.
Table 2. The performance of Gaea software.