Current research in the Bioinformatics Group leads to the design, development and assessment of computational tools to explore data in all these categories. In an effort to be well grounded in application areas, we collaborate with biologists to study the practical usefulness of the methods we develop. Here is a brief overview of some of our current research interests.
- Biocomputation and nanocomputation by self-assembly
- Comparative genomics
- Genome analysis including statistical methods for gene prediction
- Inference of inheritance patterns of mutations (called haplotype inference)
- Knowledge inference from biomedical literature
- Mass spectrometry data analysis
- Protein function prediction
- Protein structure prediction, including complete 3D structures and binding sites
- Software and theory of homology search and motif discovery
At the risk of oversimplification, we may view bioinformatics data as dealing with sequence, structure and function. The Bioinformatics Group has made a particularly strong impact in the area of sequence analysis. This includes both theoretical studies and the development of application software to process sequences. For example the PatternHunter program [Bin Ma, J. Tromp, Ming Li, PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18:3(2002), 440–5] was used in the initial sequencing and comparative analyses of the mouse genome [many authors, including Dan Brown, Ming Li and Bin Ma, Nature, December 5, 2002] and the rat genome [many authors, including Bin Ma].
Determining the structure of a protein provides important information on its function and interactions, and is crucial for a full understanding of the role played by the protein within a cell.
Current research involves development of methods to predict protein structure from amino acid sequence, using a combination of computational and biochemical techniques. Two computational aspects of this problem are the development of scoring functions capable of recognizing correctly folded proteins, and the development of search algorithms for exploring the folding space of proteins. A key achievement in this area is the development of a scoring function able to accurately recognize native protein structures (McConkey et al., Proc. Natl. Acad. Sci. U.S.A., Vol. 100, no. 6, Mar. 18, 2003, pp. 3215-3220). Another significant accomplishment has been the success of the RAPTOR program, which was ranked top among individual automatic protein 3D structure prediction servers at the recent CAFASP3 competition. Information on RAPTOR can be found in J. Xu and M. Li, Assessing RAPTOR's new linear programming approach for fold recognition in CAFASP3. PROTEINS: Structure, Function, and Genetics, 53(S6), Oct. 2003, pp. 579–84.
Modern health and agricultural research requires the high-throughput identification of proteins from biological samples. Mass spectrometry (MS) and tandem mass spectrometry (MS/MS) have become the standard experimental methods for protein identification. The complexity and size of the mass spectrometry data exclude the possibility of manual interpretation. With novel algorithm (B. Ma, K. Zhang, C. Liang. An Effective Algorithm for the Peptide De Novo Sequencing from MS/MS Spectrum. JCSS 70, 2005, pp. 418-430), we developed the PEAKS software (B. Ma et al. PEAKS: Powerful Software for Peptide De Novo Sequencing by Tandem Mass Spectrometry. Rapid Communication in Mass Spectrometry 17(20), 2003, pp. 2337–42) for peptide de novo sequencing and protein identification from tandem mass spectrometry data. The software is being used worldwide in several hundreds of research institutes and has become the industrial standard software for peptide de novo sequencing.
We have also investigated large-scale duplication in the history of the flowering plant Arabidopsis thaliana (T. Vision, D. Brown, S. Tanksley. The origins of genome duplication in Arabidopsis. Science 290, 2000, pp. 2114-2117), and in the human genome as part of the Human Genome Project, as reported in the original paper announcing the draft human genome sequence (International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 2001, pp. 860–921).