Journal of Applied Bioinformatics & Computational BiologyISSN: 2329-9533

All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Editorial, J Appl Bioinform Comput Biol Vol: 1 Issue: 1

New Era for Biocomputing

Momiao Xiong*
Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030, USA
Corresponding author : Momiao Xiong
Division of Biostatistics, The University of Texas School of Public Health, Houston, TX 77030,USA
E-mail: [email protected]u
Received: June 18, 2012 Accepted: June 19, 2012 Published: June 21, 2012
Citation: Xiong M (2012) New Era for Biocomputing. J Biocomput 1:1. doi:10.4172/2329-9533.1000e102

Abstract

New Era for Biocomputing

Fast and cheaper Next Generation Sequencing (NGS) technologies will generate unprecedentedly massive (thousands or even ten thousands of individuals) and highly-dimensional (ten or even dozens of millions) genomic and epigenomic variation data that allow nearly complete evaluation of genomic and epigenomic variation including common and rare variants, insertion/deletion, CNVs, mRNA by sequencing (RNA-seq), microRNA by sequencing (mRNA-seq), methylation by sequencing (methylation-seq) and Chip-seq. Analysis of these extremely big and diverse types of data sets provide powerful tools to comprehensively understand the genome and epigenomes.

Keywords: Biocomputing

Fast and cheaper Next Generation Sequencing (NGS) technologies will generate unprecedentedly massive (thousands or even ten thousands of individuals) and highly-dimensional (ten or even dozens of millions) genomic and epigenomic variation data that allow nearly complete evaluation of genomic and epigenomic variation including common and rare variants, insertion/deletion, CNVs, mRNA by sequencing (RNA-seq), microRNA by sequencing (mRNA-seq), methylation by sequencing (methylation-seq) and Chip-seq. Analysis of these extremely big and diverse types of data sets provide powerful tools to comprehensively understand the genome and epigenomes [1] and hold promise of shifting the focus of health care from the disease to wellness where we record enormous amounts of personal data and monitor the individual wellness status [2]. But, the volume and complexity of sequences data in genomics and epigenomics, and real time measured health care data have begun to outpace the computing infrastructures used to calculate and store genomic, epigenomic and health monitor information [3,4]. Emergence of NGS technologies also pose great computational challenges of storing, transferring and analyzing large volumes of sequencing data [5,6] in comparative genomics [7], genome assembly and sequence analysis [8], metagenomics [9,10], proteomics [11], genetic studies of complex diseases [12-14] and biomedical image analysis [15].
Innovative approaches should be developed to address these challenges. One option is to develop novel algorithms and methods to deal with new types of data. For example, the genomic and epigenomic data generated by NGS technologies first demand the changes of the concepts of genome. As Haldane [16] and Fisher [17] recognized in the last century, the genome can be modeled as a continuum. Specifically, the genome is not purely a collection of independent segregating sites. Rather, the genome is transmitted not in points, but in segments. Instead of modeling the genome as a few separated individual loci, modeling the genome as a continuum where the observed genetic variant function can be viewed as a realization of the stochastic process in the genome and modeled as a function of genomic location will enrich information on genetic variation across the genome. The new data technologies also demand the paradigm shift in genomic and epigenomic data analysis from standard multivariate data analysis to functional data analysis [18,19], from low dimensional data analysis to high dimensional data analysis [20,21], from independent sampling to dependent sampling [22], from single genomic or epigenomic variant analysis to integrated genomic and epigenomic analysis.
But, as Schatz et al. [3] pointed out, similar to scientific breakthroughs, algorithmic breakthroughs do not happen very often and are difficult to plan. A practical solution is to employ the power of parallel computing. Parallelism is the future of computing. Two types of popular parallel computing are Cloud computing and GPU (graphical process units) computing.
Cloud is a metaphor for the Internet. Cloud computing is a type of Internet-based computing. Users access computational resources from a vendor over the internet [3]. The cloud is virtualization technology [23]. It divides a server’s hardware resources into multiple ‘‘computer devices”, each running its own operating system in isolation from the other device which presents to the user as an entirely separate computer. A typical cloud computing begins by uploading data into the cloud storage, conducts computations on a cluster of virtual machines, output the results to the cloud storage and finally download the results back to the user’s local computer. Since the pool of computational resources available ‘in the cloud’ is huge, we have enough computational power to analyze large amount of data. The cloud computing has been applied to manage the deluge of ‘big sequence data’ in 1000 Genomes Project [6], comparative genomics [7], Chip-seq data analysis [24], translational medicine [25], transcriptome analysis [26], and disease risk management [27].
Although cloud computing provides a powerful solution to big data analysis, it also has limitations. Cloud computing requires large data transfer over internet, and raises data privacy and security issues. Complementary to cloud computing is GPU computing [28]. GPU conducts the task parallelism and data parallelism of the application. It divides the resources of the processor in space. The output of the part of the processor working on one stage is directly fed into the input of a different part that works on the next stage. The hardware in any given stage could exploit data parallelism in which multiple elements are processed at the same time. The highly parallel GPU has higher memory bandwidth and more computing power than central processing units (CPUs). The GPU follows a single multiple-data (SPMD) programming model and processes many elements in parallel suing the same program. It consists of hundreds of smaller cores. They work together to boost their high computer performance. The GPU computing is getting the momentum in biomedical research. It has been applied to network analysis [29], RNA secondary structure prediction [30], gene-gene interaction analysis [13,14,31], biological pathway simulation [32], sequence analysis [33], gene prediction [34], motif identification [35], Metagenomics, protein analysis [36], and molecular dynamics simulations [37].
The NGS technologies raises great expectations for new genomic end epigenomic knowledge that will translate into meaningful therapeutics and insights into health, but immense biomedical complexities make clinically meaningful new discoveries hidden within a deluge of high dimensional data and numerous number of analyses. Develop new analytic paradigm, novel statistical methods and explore the power of parallel computing for sequence-based genomic and epigenomic data analysis to overcome the serious limitation of the current paradigm and statistical methods for genomic and epigenomic data analysis. The Journal of Biocomputing provides excellent platforms to present new algorithm discovery and communicate novel ideas among Biocomputing communities. We can expect that the emergence of NGS technologies and new development in parallel computing, and publication of the Journal of Biocomputing will stimulate the development of innovative algorithms and novel paradigm for big genomic, epigenomic and clinical data analysis and open a new era for Biocomputing.

References

  1. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, et al. (2011) Integrative genomics viewer. Nat Biotechnol 29: 24-26.

  2. Lee Hood (2011) Nature Biotechnology 29:191-191.

  3. Schatz MC, Langmead B, Salzberg SL (2010) Cloud computing and the DNA data race. Nat Biotechnol 28: 691-693.

  4. Pechette JM (2012) Transforming health care through cloud computing. Health Care Law Mon 2012: 2-12.

  5. Jeon YJ, Park SH, Ahn SM, Hwang HJ (2011) SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data. Evol Bioinform Online 7: 1-6.

  6. Waltz E (2012) 1000 genomes on Amazon's cloud. Nat Biotechnol 30: 376

  7. Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, et al. (2011) Cloud computing for comparative genomics. BMC Bioinformatics 11: 259.

  8. Jourdren L, Bernard M, Dillies MA, Le Crom S (2012) Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics 28: 1542-1543.

  9. Angiuoli SV, White JR, Matalka M, White O, Fricke WF (2011) Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing. PLoS One 6: e26624.

  10. Suzuki S, Ishida T, Kurokawa K, Akiyama Y (2012) GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics. PLoS One 7: e36060

  11. Halligan BD, Geiger JF, Vallejos AK, Greene AS, Twigger SN (2009) Low cost, scalable proteomics data analysis using Amazon’s cloud computing services and open source search algorithms. J Proteome Res 8: 3148-3153.

  12. Greenbaum D, Gerstein M (2011)The role of cloud computing in managing the deluge of potentially private genetic data. Am J Bioeth 11: 39-41.

  13. Chikkagoudar S, Wang K, Li M (2011) GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores. BMC Res Notes 4:158.

  14. Yung LS, Yang C, Wan X, Yu W (2011) GBOOST: a GPU-based tool for detecting gene-gene interactions in genome-wide case control studies. Bioinformatics 27: 1309–1310

  15. Smith DS, Gore JC, Yankeelov TE, Welch EB (2012) Real-Time Compressive Sensing MRI Reconstruction Using GPU Computing and Split Bregman Methods. Int J Biomed Imaging 2012: 864827.

  16. Haldane JBS (1919) The combination of linkage values, and the calculation of distance between the loci of linked factors. J Genet 8: 299-309.

  17. Fisher RA (1965) The theory of inbreeding. Edinburgh and London: Oliver & Boyd Ltd.

  18. Ramsay JO et al. (2005) Functional data analysis. New York: Springer.

  19. Luo L, Boerwinkle E, Xiong M (2011) Association studies for next-generation sequencing. Genome Res 21: 1099-1108.

  20. Izenman AJ (2008) Modern multivariate statistical techniques: regression, classification and manifold learning. 734.

  21. Roweis ST, Saul LK (2000) Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290: 2323-2326.

  22. Zhu Y, Xiong M (2012) Family-Based Association Studies for Next-Generation Sequencing. Am J Human Genet 90: 1028-1045.

  23. Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ (2011) Biomedical cloud computing with Amazon Web Services. PLoS Comput Biol 7: e1002147.

  24. Feng X, Grossman R, Stein L (2011) PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinformatics 12: 139.

  25. Dudley JT, Pouliot Y, Chen R, Morgan AA, Butte AJ (2010) Translational bioinformatics in the cloud: an affordable alternative. Genome Med 2: 51.

  26. Chouvarine P, Cooksey AM, McCarthy FM, Ray DA, Baldwin BS (2012) Transcriptome-based differentiation of closely-related Miscanthus lines. PLoS One 7: e29850.

  27. Cheng KC, Hinton DE, Mattingly CJ, Planchart A (2011) Aquatic models, genomics and chemical risk management. Comp Biochem Physiol C Toxicol Pharmacol 155: 169-173.

  28. Owens JD, Houston M, Luebke D, Green S, Stone JE, et al. (2008) GPU computing. Proceedings of the IEEE 96:879-899.

  29. Shi Z, Zhang B (2011) Fast network centrality analysis using GPUs. BMC Bioinformatics12:149.

  30. Lei G, Dou Y, Wan W, Xia F, Li R, et al. (2012) CPU-GPU hybrid accelerating the Zuker algorithm for RNA secondary structure prediction applications. BMC Genomics 13: S14.

  31. Hemani G, Theocharidis A, Wei W, Haley C (2011) EpiGPU: exhaustive pairwise epistasis scans parallelized on consumer level graphics cards. Bioinformatics 27:1462-1465.

  32. Chalkidis G, Nagasaki M, Miyano S (2011) High performance hybrid functional Petri net simulations of biological pathway models on CUDA. IEEE/ACM Trans Comput Biol Bioinform 8: 1545-1556.

  33. Liu CM, Wong T, Wu E, Luo R, Yiu SM, et al. (2012) SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 28: 878-879.

  34. Rivard SR, Mailloux JG, Beguenane R, Bui HT (2012) Design of high-performance parallelized gene predictors in MATLAB. BMC Res Notes 5: 183.

  35. Zandevakili P, Hu M, Qin Z (2012) GPUmotif: An Ultra-Fast and Energy-Efficient Motif Analysis Program Using Graphics Processing Units. PLoS One 7: e36865.

  36. Blazewicz J, Frohmberg W, Kierzynka M, Pesch E, Wojciechowski P (2011) Protein alignment algorithms with an efficient backtracking routine on multiple GPUs. BMC Bioinformatics 12:181.

  37. Le L, Lee EH, Hardy DJ, Truong TN, Schulten K (2010) Molecular dynamics simulations suggest that electrostatic funnel directs binding of Tamiflu to influenza N1 neuraminidases. PLoS Comput Biol 6: e1000939.

Track Your Manuscript

Share This Page