Data & Analysis

Cloud Computing

BGI is constantly developing the hardware and software to handle the huge volumes of bioinformatics data using HPC (High Performing Computing) and Cloud computing, and aims to process massive amounts of data with lower cost and higher efficiency to help researchers from all over the world facilitate their research work.

должность беременной сократили что такое индексация зарплаты как делать почта москвы телефоны для справок льготы для ветеранов прокуратуры пдд пересечение главной дороги при загран паспорте приписное образец заявления на инн коммунальные платежи за прописку стаж на чп если нет отчислений по налогам испытательный срок трудовое право досрочный выход на работу из декретного отпуска договор о полной материальной ответственности кассира состав уголовного преступления в таблице договор на геодезические работы запись в трудовой при приеме генерального директора

Figure 1. Improvement of computing & storage capability in BGI

Two Centers of Cloud Computing in BGI

Currently, there are two large centers in BGI’s cloud computing platform — Bio-Data Centre (CliMB) and Biol-Cloud Computing centre (BGI Cloud).

Bio-data Centre - CliMB

CLiMB is a bio-data center that was independently designed by BGI. The core features include data service, sequence alignment and many others. Building a world-class bio-data center will provide the data support for the health and sustainable development of life science industry and provide high-quality data for researchers from the biology field. CLiMB also supports the data service for Giga Science, a journal published by BGI in partnership with BioMed Central.

Linkage: http://climb.genomics.org.cn

Bio- Cloud Computing Centre - BGI Cloud

BGI Cloud applies advanced technology for cloud storage and cloud computing, integrating the common data and valuable, specific sequencing data of BGI in the field of genomics. BGI also deploys standard bioinformatics analysis pipelines and many kinds of alignment and analysis software to build a set of easy-to-use, flexible, one-stop automatic services for mass data storage and computing. Users are able to remotely conduct the sequencing data processing through the network and get access to the bioinformatics analysis resource anytime anywhere. By integrating powerful computing hardware and rich application software, BGI Cloud provides an end-to-end computing service solution to users around the world.

BGI Cloud Workflow

Figure 2. BGI Bio-Cloud Workflow

Cloud-Based Software in BGI

BGI has developed two new cloud-based software-as-a-service offerings for next-gen data analysis called Hecate and Gaea. They are “flexible computing” solutions for de novo assembly and genome re-sequencing.

Hecate

Hecate is based on a series of distributed algorithms on map/reduce framework to recognize and simplify non-branching repeat-free regions of the genome, correct errors and resolve the ambiguous bubbles and short repeats. Together with the distributed graph shrinkage algorithms to construct a linear DNA sequence, Hecate offers greatly reduced running time and hardware cost for short sequence assembly. For example, using 96 Hecate cores, the genome coverage would increase to 84% in 42 hours at a price of $60,000, a savings of 28 hours and $90,000 compared to SOAPdenovo running on a single server (Table 1).

Human Genome	SOAP de novo	Hecate (48 cores)	Hecate (96 cores)
Genome Coverage	80%	84%	84%
Contig N50	1,050	940	940
Contig N90	205	100	100
Scaffold N50	440,000	300,000	300,000
Scaffold N90	78,000	20,000	20,000
Mismatch Rate	10^-6	10^-6~10^-5	10^-6~10^-5
Assembly Time	70 Hours	76 Hours	42 Hours
Number of Servers	1	6	12
Memory Size	500 GB x 1	24 GB x 6	24 GB x 12
Hardware Price	USD 150,000	USD 30,000	USD 60,000
Mode	Centralized	Distributed	Distributed

Table 1. The performance comparisons of Hecate & SOAPdenovo

Gaea

Gaea is designed to distribute re-sequencing computation to a cluster of nodes based on the Hadoop Streaming framework with personalized algorithm interfaces for running the SOAP2, BWA, SAMTools, SOAPsnp, DIndel, and BGI’s realSFS. Gaea is designed to distribute re-sequencing computation to a cluster of nodes to greatly reduce the running time and hardware cost. For the current version of Gaea (v 1.2), speed increases of 75x for SOAP2 and 90x for BWA using 100 cores. At 400 cores those numbers rose to 300x and 346x speed increases compared to running either algorithm on a single core.

	Time (1 core)	Speedups (100 cores)	Speedups (200 cores)	Speedups (400 cores)
SOAP2	82,200s	75x	152x	300x
BWA	62,000s	90x	168x	346x
Samtools	360s	80x	155x	300x
DIndel	46,000s	60x	112x	210x
realSFS	4,000s	40x	75x	140x
Hardware Expense	$400	$40,000	$80,000	$160,000
Mode	Centralized	Distributed	Distributed	Distributed

Table 2. The performance of Gaea software.