Technical Information
Contact Us / Wish List

Human tumor xenografts can yield major efficacy in cancer research and constitute most tumor biology models for cancer drug discovery. Tumors can be maintained by serial xenografting in athymic (nude) or severe combined immunodeficient (SCID) mice. However, sequences that are very similar between human and mouse genomes may be impossible to differentiate. The similar sequences cover nearly 13% of the human genome. Mouse stromal cell contamination complicates downstream bioinformatics analysis of massively parallel sequencing research. Simulations suggest that the confusion cannot be detected in a mixture, regardless of the level of sequencing coverage achieved. However, to our knowledge, there is not currently a versatile algorithm that excludes potential contamination of the mouse genomic DNA.

Thus, an accurate and sensitive bioinformatics analysis pipeline is important to distinguish the reads between human and mouse genomic DNA of xenografted tumors. BGI has developed an advanced strategy called PDXomics to filter the non-malignant contamination in xenografts that are largely from murine cells.


  • Effectively remove the murine DNA contaminations.
  • High accuracy of variation detection in xenograft samples.
  • High throughput and cost effective.




We conducted an internal test on three pairs of primary and matched xenografted tumor samples. We used mm9 as the mouse reference. The results demonstrated that PDXomics significantly reduced the false-positive rate of identified single-nucleotide mutations.


Comparison results between standard pipelines and PDXomics
    PT1vs. XT1a PT2 vs. XT2 PT3 vs. XT3

Standard Pipelineb

FPRc 67.27% 132.60% 199.79%
FNRd 5.65% 6.58% 11.62%
PDXomics FPR 2.72% 3.06% 2.77%
FNR 5.09% 6.66% 12.67%


  • PT denotes primary tumor and XT represents xenografted tumor;
  • Standard pipeline consists of bioinformatics analysis of conventional human tumor tissue samples without a filtering step;
  • FPR, false-positive rate, is calculated as the number of different genotypes in the target region in the mixed samples divided by the SNPs number in the target region in human sample;
  • FNR, false-negative rate, is calculated as the number of different genotypes in the target region in the human samples divided by the SNPs number in the target region in human sample.


Filtering Strategies

If the genome sequences between the host mouse and Mus musculus are very similar, we adopt mm9 as the mouse reference to filter the contamination from mouse genomic DNA. Otherwise, the host-mouse genome sequences are required to filter the reads from the host-mouse genomic DNA. We need to first assemble a reference. A set of 30 fold data (about 90G) of whole genome re-sequencing is generated to assemble contigs that are set as reference using SOAPdenovo software. A contig is a contiguous sequence of bases constructed by aligning reads and building consensus. (A de novo assembly of the host mouse genome is the best approach when cost is not an issue.)

To ensure that the part of human reads will not be discarded as mouse reads that are located in the mouse-human syntenic regions, a specific analysis pipeline, named PDXomics, using proprietary algorithms is applied to the xenografted tumor sequencing. The sequencing reads are mapped to a mixed reference-set containing human reference (hg19) and mouse reference (mm9 or assembled mouse contigs) directly. Using this robust algorithm, we can distinguish the reads that tend to align with the human reference and mouse reference. Part of the reads cannot be distinguished, covering 0.5-1% of the total reads, based on our experience. We only consider those reads that tend to align with the human reference for further routine bioinformatics analysis, genomic-wide variation identification, or gene expression profiling analysis.