Genomes to Life Contractor-Grantee Workshop II
February 29-March 2, 2004, Washington, D.C.
Genomics:GTL Program Projects
Sandia National Laboratories
Carbon Sequestration in Synechococcus
From Molecular Machines to Hierarchical Modeling
11
Modeling Cellular Response
Mark D. Rintoul (rintoul@sandia.gov), Steve Plimpton, Alex Slepoy, and Shawn Means
Sandia National Laboratories, Albuquerque, NM
While much of the fundamental research on prokaryotes is focused on specific molecular mechanisms within the cell, the aggregate cellular response is also important to practical problems of interest. In this poster, we present results for two computational models of cellular response that take spatial effects into consideration. The first model is a discrete particle code where particles diffuse and interact via Monte Carlo rules so that species concentrations track chemical rate equations. This type of model is relevant to cases where there are a small number of interacting particles, and the spatial and temporal fluctuations in particle number can play a significant role in affecting cellular response. Results with this code are shown for a simulation of the carbon sequestration process in Synechococcus WH810. The second model utilizes continuum modeling focusing on carbon concentrations inside and outside of the cell, in an effort to understand carbon transport by Synechococcus within a fluid-dynamic marine environment. It is based on solving partial differential equations on a realistic geometry using finite element methods.
12
The Synechococcus Encyclopedia
Nagiza F. Samatova1, Al Geist1 (gst@ornl.gov), Praveen Chandramohan1, Ramya Krishnamurthy1, Gong-Xin Yu1, and Grant Heffelfinger2
1Oak Ridge National Laboratory, Oak Ridge, TN and 2Sandia National Laboratories, Albuquerque, NM
Synechococcus sp. are abundant marine cyanobacteria known to be important to global carbon fixation. Although the genome sequencing of Synechococcus sp. is complete by the DOE JGI1, the actual biochemical mechanisms of carbon fixation and their genomic basis are poorly understood. This topic is under both experimental and computational investigation by several projects including Dr. Brian Palenik’s DOE MCP project, SNL/ORNL GTL Center2 and others. These projects have been generating heterogeneous data (e.g., sequence, structure, biochemical, physiological and genetic data) distributed across various institutions. Integrative analysis of these data will yield major insights into the carbon sequestration behavior of Synechococcus sp. However, such analysis is largely hampered by the lack of a knowledgebase system that enables an efficient access, management, curation, and computation with these data as well as comparative analysis with other microbial genomes. To fulfill these requirements, the Synechococcus Encyclopedia is being created as part of the SNL/ORNL GTL Center.
The completed sequencing of Synechococcus sp. has allowed having a reference axis upon which any type of annotation can be layered. Not only can genomic features such as genes and repeats be placed upon such a reference, but it is also possible to map a variety of other data such as operons and regulons, mutations, phenotypes, gene expressions (e.g., microarray, phage display, mass spec, 2-hybrid), pathway models, protein interactions, and structures. A major benefit of such feature mapping is that each of these annotations can be cross-referenced to each other. The Synechococcus Encyclopedia takes advantage of this fact to allow users to view and track a variety of biological information associated with the genome and to enable complex queries across multiple data types.
In order to make the exploration and in-depth analysis of genome information easier, one needs appropriate ways to browse and query the corresponding data. The World Wide Web interface of the Synechococcus Encyclopedia was built up with these specifications in mind. It offers a number of ways to retrieve information about a genome. For instance, an advanced search capability is available to combine several search criteria and retrieve detailed information about any intricate features. Moreover, the search can be restricted to a genome region of interest, molecular function, biological process, or cellular component.
To ease data retrieval, all output reports are presented in tabular format and maybe conveniently downloaded as tab-delimited text. If desired, the tables can be easily customized, e.g. adding or removing features, or changing the sort order according to several data fields. Data are accessible through a variety of interactive graphical viewers. Furthermore, the retrieved data entries can be further explored by launching complex analysis tools or linking to other data collections such as Swiss-Prot, Pfam, InterPro, PDB, etc.
The Synechococcus Encyclopedia comprises information at various levels: genome sequence, structure, regulation, protein interactions, systems biology. For example, at the genome sequence and annotation level, it includes the protein- and RNA-coding genes, Pfam domains, Blocks motifs, InterPro signatures, and COG- and KEGG-based functional assignments. At the structure level, it presents secondary and tertiary structural models predicted by PROSPECT3 and other tools, SCOP-based functional assignments, and FSSP profiles for homologous protein sequences. At the regulation level, it integrates data about promoters, transcription factors, and pathway models as well as microarray data.
The Synechococcus Encyclopedia is accessible at http://www.genomes-to-life.org/syn_wh. The generic data model and data integration, search and retrieval engine that it is based on, makes it possible to set up similar knowledgebases for other bacterial species. New functionalities for multi-genome integration and comparative analysis of genomes are being developed to facilitate better understanding of genomic organization and biological function. Moreover, other features such as annotation and curation services, data provenance, and security will be added in the near future.
References
- http://genome.ornl.gov/microbial/syn_wh
- Grant Heffelfinger (gsheffe@sandia.gov) and Al Geist (gst@ornl.gov); http://www.genomes2life.org
- Ying Xu (xyn@ornl.gov) and Dong Xu (xudong@missouri.edu); http://compbio.ornl.gov/structure/prospect/
13
Carbon Sequestration in Synechococcus: A Computational Biology Approach to Relate the Genome to Ecosystem Response
Grant S. Heffelfinger (gsheffe@sandia.gov)
Sandia National Laboratories, Albuquerque, NM
This talk will provide an update on the progress to date of the Genomics:GTL (GTL) project led by Sandia National Laboratories: “Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling.” This effort is focused on developing, prototyping, and applying new computational tools and methods to elucidate the biochemical mechanisms of the carbon sequestration of Synechococcus Sp., an abundant marine cyanobacteria known to play an important role in the global carbon cycle. Our project includes both an experimental investigation as well as significant computational efforts to develop and prototype new computational biology tools. Several elements of this effort will be discussed including the development of new methods for high-throughput discovery and characterization of protein-protein complexes and novel capabilities for inference of regulatory pathways in microbial genomes across multiple sources of information. Our progress developing new computational systems-biology methods for understanding the carbon fixation behavior of Synechococcus at different levels of resolution from the cellular level to ecosystem will also be discussed. More information about our project and partners can be found at www.genomes-to-life.org.
14
Improving Microarray Analysis with Hyperspectral Imaging, Experimental Design, and Multivariate Data Analysis
David M. Haaland1 (dmhaala@sandia.gov), Jerilyn A. Timlin1, Michael B. Sinclair1, Mark H. Van Benthem1, Michael R. Keenan1, Edward V. Thomas1, M. Juanita Martinez2, Margaret Werner-Washburne2, Brian Palenik3, and Ian Paulsen4
1Sandia National Laboratories, Albuquerque, NM; 2University of New Mexico, Albuquerque, NM; 3Scripps Institution of Oceanography, La Jolla, CA; and 4The Institute for Genomic Research, Rockville, MD
At Sandia National Laboratories, we are combining hyperspectral microarray scanning, efficient experimental designs, and a variety of new multivariate analysis approaches to improve the quality of data and the information content obtained from microarray experiments. Our approach is designed to impact the Sandia-led GTL team’s investigation of Synechococcus for carbon sequestration. Current commercial microarray scanners use univariate methods to quantify a small number of dyes on printed microarray slides. We have developed a new hyperspectral microarray scanning system that offers higher throughput for each microarray slide by allowing the quantitation of a large number of dyes on each slide. The new scanner has demonstrated improved accuracy, precision, and reliability in quantifying dyes on microarrays and yields a higher dynamic range than possible with current commercial scanners. We will present the design of the new scanner, which collects the entire fluorescence spectrum from each pixel of the scanned microarray, and the use of multivariate curve resolution (MCR) algorithms to obtain pure emission spectra and corresponding concentration maps from the hyperspectral image data. The new scanner has allowed us to detect contaminating autofluorescence that emits at the same wavelengths as the reporter fluorophores on microarray slides. With the new scanner, we are able to generate relative concentration maps of the background, impurity, and fluorescent labels at each pixel of the image. Since the MCR generated concentration maps of the fluorescent labels are unaffected by the presence of background and impurity emissions, the accuracy and useful dynamic range of the gene expression data are both greatly improved. We will also demonstrate that the new scanner helps us understand a variety of artifacts that have been observed with microarrays scanned using two-color scanners. Artifacts include high background intensities, “black holes,” dye separation, the presence of unincorporated dye, and contaminants that have led to the practice of intensity-dependent normalizations.
We will describe statistically designed microarray experimental approaches that we have used to identify and eliminate experimental error sources in the microarray technology. These statistically designed experiments have led to dramatic improvements in the quality and reproducibility of yeast microarray experiments. The lessons learned from yeast arrays will be applied directly to our GTL investigations of Synechococcus microarrays. In addition, new approaches with multivariate algorithms that incorporate error covariance of the arrays into the multivariate analysis of microarrays will be presented along with methods to evaluate the relative performance of various gene selection, classification, and multivariate fitting algorithms.
Evaluation of hybridization experiments with initial 250 gene Synechococcus microarrays will be presented along with graphical methods designed to facilitate understanding of the quality and repeatability of the Synechococcus microarray data. Work has recently begun on the whole genome Synechococcus microarray experiments. cDNA arrays of 2496 genes from Synechococcus have been printed on glass slides. We will present the unique design of the whole genome arrays which includes six replicates for each gene each printed with a different pin to capture the true within array repeatability of gene expression. The arrays also include multiple positive controls, negative controls, blanks, and solvent spots in each block of the whole genome microarray. In addition, Arabidopsis promoter 70mers were printed on the four corners of each block to assist in positioning and reading the array. If available at the time of the workshop, gene expression results from the whole genome Synechococcus microarrays will be presented.
Sandia is a multi-program laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-ACO4-94AL85000. This work was funded in part by the US Dep’t of Energy’s Genomics:GTL program (genomicsgtl.energy.gov) under project, “Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling,” (http://www.genomes2life.org/). This project also supported in part by a grant from the W. M. Keck Foundation.
15
Multi-Resolution Functional Characterization of Synechococcus WH8102
Nagiza F. Samatova*1 (samatovan@ornl.gov), Andrea Belgrano9, Praveen Chandramohan1, Pan Chongle1, Paul S. Crozier2, Al Geist1, Damian Gessler9, Andrey Gorin1, Jean-Loup Faulon2, Hashim M. Al-Hashimi7, Eric Jakobbson4, Elebeoba May2, Anthony Martino2, Shawn Means2, Rajesh Munavalli1, George Ostrouchov1, Brian Palenik5, Byung-Hoon Park1, Susan Rempe2, Mark D. Rintoul2, Diana Roe2, Peter Steadman9,Charlie E. M. Strauss3, Jerilyn Timlin2, Gong-Xin Yu1, Maggie Werner-Washburne10, Dong Xu8, Ying Xu6, and Grant Heffelfinger2 (gsheffe@sandia.gov)
*Presenting author
1Oak Ridge National Laboratory, Oak Ridge, TN; 2Sandia National Laboratories, Albuquerque, NM and Livermore, CA; 3Los Alamos National Laboratory, Los Alamos, NM; 4University of Illinois, Urbana Champaign, IL; 5Scripps Institution of Oceanography, La Jolla, CA; 6University of Georgia, Atlanta, GA; 7University of Michigan, Ann Arbor, MI; 8University of Missouri, Columbia, MO; 9National Center for Genome Resources, Santa Fe, NM; and 10University of New Mexico, Albuquerque, NM
Although sequencing of multi-megabase regions of DNA has become quite routine, it remains a big challenge to characterize all the segments of DNA sequence with various biological roles such as encoding proteins and RNA or controlling when and where those molecules are expressed. The primary difficulty is that function exists at many hierarchical levels of description, it has temporal and spatial connections that are difficult to manage, and functional descriptions do not correspond to well-defined physical models like biological structures defined by Cartesian coordinates for atoms. Results from the completed prokaryotic genome sequences show that almost half of the predicted coding regions identified are of unknown biological function. Specifically, the completed sequencing of Synechococcus WH8102 by JGI and multi-institutional annotation effort1 has resulted in 1196 (out of 2522) ORFs that are conserved hypothetical or hypothetical.
The goal of this work is to develop a suite of computational tools for systematic multi-resolution functional characterization of microbial genomes and utilize them to elucidate the biochemical mechanisms of the carbon sequestration of Synechococcus sp. as part of the SNL/ORNL GTL Center2. Functional characterization is conducted on many levels and with different questions in mind – ranging from the reconstruction of genome-wide protein-protein interaction networks to detailed studies of the geometry/affinity in a particular complex. Yet as the questions asked at the different levels are often intricately related and interconnected, we are approaching the problem from several directions. Here we outline some of them and provide pointers to more information.
While analysis of a single genome provides tremendous biological insights on any given organism, comparative analysis of multiple genomes can provide substantially more information on the physiology and evolution of microbial species. Comparative studies expand our ability to better assign putative function to predicted coding sequences and our ability to discover novel genes and biochemical pathways. The KeyGeneMiner3 aims to identify “key” genes that are responsible for a given biochemical process of interest. When applied to the oxygenic photosynthetic process, it has discovered 126 genome features. Many of them have been reported in literature as either photosynthesis-related or photosynthesis-specific (occurring only in photosynthetic genomes). Likewise, the construction of comprehensive phylogenetic profiles for all transport proteins in the bacterial genomes4 allows us to pick up regulators of transport and help annotate some genes for which there is still no, or weak, annotation. The approach is not limited to transport proteins; it can be done with any set of probe sequences representing an interesting functional grouping.
After genes are assigned to putative biochemical processes or putative functional links are established between genes, there still remains a significant challenge to understand how these biomolecules interact to form pathways for metabolic conversion from one substance to another and how genes form networks to regulate the timing and location events within the cell. In spite of some promising work using Boolean and Bayesian networks, all these approaches are challenged with too many parameters (compared to the number of data points) to adequately constrain the problem thus resulting in too many plausible competing gene models. The complexity of this problem is begging for novel methods that could integrate information from appropriate databases (e.g., gene expression, protein-protein interaction data, operon and regulon structure, transcription factors) in order to constrain the set of plausible solutions as well as to use this information for designing targeted experiments for study of specific network modules. Our progress towards this goal includes the development of new and effective protocols for systematic characterization of regulatory pathways and a preliminary version of a computational pipeline for interpretation of multiple types of biological data for biological pathway inference5. We have also prototyped these methods on Synechococcus WH8102 to make several predictions, including a signaling/regulatory network for the phosphorus assimilation pathway. We have developed an environment6 to aid biologists in the analysis of proposed networks via visual browsing of the annotation information and original data associated with different elements of these networks. Using these tools and large visualization corridor with 48 high-resolution screens we have been able to begin assigning biological processes to groups of genes which were aggregated together by similar expression, as measured by microarrays, and then placed into a Boolean network based on discrete (Boolean) expression levels.
Structural characterization of protein machines provides additional valuable insights about the mechanism and details of their function. In spite of a long history behind the development of computational methods for structural characterization of protein machines, many challenges still remain to be addressed. Specifically, we are focusing on the methods for understanding how biological molecules interact physically to transfer signals, including protein-protein interactions as well as protein-DNA and protein-RNA interactions7. At the initial stage we apply structure prediction methods to determine protein fold families with our ROSETTA8 and PROSPECT9 programs and use inferred structural similarities to create hypotheses about their interacting partners. The computational pipeline merges several bioinformatics and modeling tools including algorithms for protein domain division, secondary structure prediction, fragment library assembly, and structure comparison7. Application of this pipeline has resulted in genome-scale structural models for Synechococcus genes10.
Protein interactions with other molecules play a central role in determining the functions of proteins in biological systems. Protein-protein, protein-DNA, protein-RNA, and enzymes-substrate interactions are a subset of these interactions that is of key importance in metabolic, signaling and regulatory pathways. Bacterial chemotaxis, osmoregulation, carbon fixation and nitrogen metabolism are just a few examples of the many complex processes dominated by such molecular recognition. Identification of interacting proteins is an important prerequisite step in understanding its physiological function. We develop several complimentary approaches to this problem including statistically significant protein profiles based3, signature kernel SVMs based11, and genomic context based5. Utilization of these approaches produced a genomes-scale protein interaction map for Synechococcus WH8102.
Knowing interacting partners is just the first step towards elucidating the order and control principles of molecular recognition. One of the most remarkable properties of protein interactions is high specificity. Even presumably specific binding sites may bind a range of ligands with different compositions and shapes. For example, the Ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) can catalyze two separate reactions, carboxilation and oxygenation reactions depending on whether CO2 or O2 binds in its active site, respectively. There must be then functionally important residues that enable different proteins to recognize their unique interacting partners. Identification of these functionally important residues (e.g. docking interfaces, catalytic centers, substrate and cofactor binding sites and hinge-motion controlling loops) is essential for functional characterization of protein machines. We explore several complimentary approaches to identify protein docking interfaces based on quantification of correlated mutations3, Boosting driven separation into “predictable” feature subspaces3, and statistically significant separation of likely n-mers6. Likewise, our Surface Patch Ranking method3 has been utilized to identify clusters of residues important for CO2/O2 specificity in Rubisco. Finally, our tools for full atom modeling of protein interactions (LAMMPS, PDOCK)10 are being used to assess Rubisco catalytic specificity.
All these tools and experiments generate heterogeneous data (e.g., sequence, structure, biochemical, physiological and genetic data). Integrative analysis of these data will yield major insights into the carbon sequestration behavior of Synechococcus sp. For this to occur, low level molecular data and processes need to be connected to macro-ecological models. We are doing this in a Hierarchical Simulation Platform12 that uses recently published models connecting metabolic rate to population growth and trophic level energy flux. Additionally, we have built a web portal for Synechococcus Encyclopedia13 that enables an efficient access, management, curation, and computation with these data as well as comparative analysis with other microbial genomes.
References
- http://genome.ornl.gov/microbial/syn_wh
- Grant Heffelfinger (gsheffe@sandia.gov) and Al Geist (gst@ornl.gov); http://www.genomes2life.org
- Nagiza Samatova (samatovan@ornl.gov)
- Eric Jakobsson (jake@ncsa.uiuc.edu)
- Ying Xu (xyn@ornl.gov), Brian Palenik (bpalenik@ucsd.edu), and Dave Haaland (dmhaala@sandia.gov); see “Microarray Analysis with Hyperspectral Imaging, Experimental Design, and Multivariate Data Analysis” and “Methods for Ellucidating Synechococcus Regulatory Pathways” posters
- George S Davidson (gsdavid@sandia.gov)
- Andrey Gorin (agor@ornl.gov); see also “Bioinformatics Methods for Mass Spect Analysis” poster
- Charlie Strauss (cems@lanl.gov)
- Ying Xu (xyn@ornl.gov) and Dong Xu (xudong@missouri.edu); http://compbio.ornl.gov/structure/prospect/
- Ying Xu (xyn@ornl.gov); http://compbio.ornl.gov/PROSPECT/syn/
- Daniel M Rintoul (mdrinto@sandia.gov) and Antony Martino (martino@sandia.gov); see also “Modeling Cellular Response” poster
- Damian Gessler (ddg@ncgr.org), Andrea Belgrano (ab@ncgr.org), and Peter Steadman (ps@ncgr.org).
- Al Geist (gst@ornl.gov) and Nagiza Samatova (samatovan@ornl.gov); see also “The Synechococcus Encyclopedia” poster.
16
Computational Inference of Regulatory Networks in Synechococcus sp WH8102
Zhengchang Su1, Phuongan Dam1, Hanchuan Peng1, Ying Xu1 (xyn@bmb.uga.edu), Xin Chen2, Tao Jiang2, Dong Xu3, Xuefeng Wan3, and Brian Palenik4
1University of Georgia, Athens, GA and Oak Ridge National Laboratory, Oak Ridge, TN; 2University of California, Riverside, CA; 3University of Missouri, Columbia, MO; and 4University of California, San Diego, CA
In living systems, control of biological function occurs at the cellular and molecular levels. These controls are implemented by the regulation of activities and concentrations of species taking part in biochemical reactions. The complex machinery for transmitting and implementing the regulatory signals is made of a network of interacting proteins, called regulatory networks. Characterization of these regulatory networks or pathways is essential to our understanding of biological functions at both molecular and cellular levels.
We have been developing a prototype system for computational inference of regulatory and signaling pathways for the genome of Synechococcus sp. WH8102. Currently, the prototype system consists of the following components: (a) prediction of gene functions, (b) prediction of terminators of operon structures, (c) genome-scale prediction of operon structures, (d) genome-scale prediction of regulatory binding sites, (e) mapping of orthologous genes and biological pathways across related microbial genomes, (f) prediction of protein-protein interactions, (g) mapping biological pathways across related genomes, and (h) inference of pathway models through fusing the information collected in steps (a) through (g).
- Computational prediction of gene functions. We have previously developed a computational pipeline for inference of protein structures and functions at genome scale (Shah, et al. 2003 and Xu, et al. 2003). The pipeline consists of both sequence-based homology detection programs like psi-BLAST, and structure-based homology detection program PROSPECT (Xu, et al. 2000). We found that using structure-based approach in addition to psi-BLAST, we can detect additional 10-20% of remote homologs for genes in a microbial genome. This pipeline can be accessed at http://compbio.ornl.gov/cgi-bin/PROSPECT/PROSPECT-Pipeline/proteinpipeline_form.cgi. We have applied this pipeline to all the orfs of Synechococcus sp WH8102, and assigned close to 80% of its genes to some level of functions. All the results can be found at http://compbio.ornl.gov/PROSPECT/syn/.
- Prediction of terminators of operons. We are in the early phase of developing a computational capability for prediction of terminators in WH8102. Our initial focus has been on rho-independent terminators (RIT). We are carrying out a comparative genome analysis to compare the RITs of the otthologous genes in different genomes to identify possible conserved patterns. We are also developing and implementing a novel algorithm based on MST clustering approaches to use common features of RITs to predict new RITs. We will apply more sophisticated energy functions than the one used in TransTerm and RNAMotif.
- Prediction of operons. We have been working on a comparative genomics approach for predicting operons in Synechococcus sp. WH8102 that combines many known characteristics of an operon structure concerning the functions, intergenic distances and transcriptional directions of genes, promoters, terminators, etc. in a unified likelihood framework (Chen, et al. 2003). The data and results are available to the public at http://www.cs.ucr.edu/~xinchen/operons.htm. We have used the predicted operons, as one piece of information, in our inference of regulatory pathways in Synechococcus sp WH8102.
- Prediction of regulatory binding sites. We have previously developed a computer program CUBIC for identification of consensus sequence motifs as possible regulatory binding sites (Olman et al. 2003). CUBIC solves the binding site identification problem as a problem of identifying data clusters from a noisy background. We have applied CUBIC for binding site predictions at genome scale, through identifying orthologous genes of WH8102 in other related genomes and application of CUBIC to the upstream regions of each set of orthologous genes. We expect that the genome-scale binding site prediction results will be publicly available within weeks.
- Mapping of orthologous genes across related genomes. The identification of orthologous genes is a fundamental problem in comparative genomics and evolution, and is very challenging especially on a genome-scale. We have been working on anew approach for assigning orthologs between different (but related) genomes based on homology search and genome rearrangement. The preliminary experimental results on simulated and real data demonstrate that the approach is very promising (it is competitive to the existing methods), although more needs to be done (Chen, et al. 2004).
- Prediction of protein-protein interactions. We have implemented a computer software for predicting protein-protein interactions, employing a number of popular prediction strategies, including mapping against protein-protein interaction maps derived from experiments (like two hybrid), application of phylogenetic profile analysis (Pellegrini et al. 1999) and gene fusion method (Marcotte et al. 1999). We have made a genome-scale prediction of protein-protein interactions for Synechococccus sp genes.
- Mapping of biological pathways across related genomes. We have recently developed a computational method for mapping of biological pathways across related microbial genomes. The core component of the algorithm/program is orthologous gene mapping under the constraints of (a) operon structures, (b) regulon structures (defined in terms of operons with common regulatory binding sites), and (c) co-expressions of genes. We have implemented this algorithm as a computer program P-MAP, and apply this program to assign all known pathways in E. coli. (partial or complete) to Synechococcus sp. WH8102. The mapping results will soon be posted at our Synechococcus Knowledge Dabase at http://csbl.bmb.uga.edu/~peng/home.html.
- Pathway inference through information. We have developed a computational protocol for inference of regulatory and signaling pathways through fusing the information collected in the steps (a) through (g) and a simple merging-voting scheme to put the predicted complexes, protein-DNA interactions and protein-protein interactions. We have applied this capability to the inference to the prediction of a number of regulatory pathways, including phosphorus assimilation pathway [ref], nitrogen and carbon assimilation pathways [unpublished results] of Synechococcus sp WH8102. We are currently exploring a number of formalisms for piecing together predicted gene associations into pathway models, including Biochemical Systems Theory (BST) (Savageau 1976).
Experimental validations of predictions are being carried out using microarray analyses (see Haaland et al. poster) of wild type WH8102 and knockout mutants under selected growth conditions.
ACKNOWLEDGEMENTS. This work is funded in part by the US Department of Energy’s Genomics:GTL (genomicsgtl.energy.gov) under project “Carbon Sequestration in Synechococcus sp: From Molecular Machines to Hierarchcal Modeling” (www.genomes-to-life.org).
References
- X. Chen, Z. Su, P. Dam, B. Palenik, Y. Xu, and T. Jiang. Operon prediction by comparative genomics: an application to the Synechochoccus sp. WH8102 genome. 2003, submitted to Nuc. Acids Res. (in revision).
- X. Chen, J. Zheng, Z. Fu, P. Nan, Y. Zhong, S. Lonardi, and T. Jiang. Assignment of orthologous genes via genome rearrangement. 2004, submitted to ISMB’2004.
- E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice et.al, Detecting protein function and protein-protein interactions from genome sequences, Science, 285:751-753, 1999.
- V. Olman, D. Xu and Ying Xu, “Identification of Regulatory Binding-sites using Minimum Spanning Trees”, Proceedings of the 7th Pacific Symposium on Biocomputing (PSB), pp 327-338, 2003.
- M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg et.al, Assigning protein functions by comparative genome analysis: protein phylogenetic profiles, Proc Natl Acad Sci USA, 96:4285-4288, 1999.
- M. Shah, S. Passovets, D. Kim, K. Ellrott, L. Wang, I. Vokler, P. Locascio, D. Xu, Ying Xu, A Computational Pipeline for Protein Structure Prediction and Analysis at Genome Scale, Proceedings of IEEE Conference on Bioinformatics and Biotechnology, 3-10, IEEE/CS Press, 2003 (An expanded journal version is published in Bioinformatics, 19(15):1985-1996, 2003).
- M. A. Savageau. “Biochemical Systems Analysis: A Study of Function and Design in Molecular Biology,” Addison-Wesley, Reading, Mass (1976).
- Z. Su, A. Dam, X. Chen, V Olman, T. Jiang, B. Palenik, and Ying Xu, Computational Inference of Regulatory Pathways in Microbes: an application to the construction of phosphorus assimilation pathways in Synechococcus WH8102, Proceedings of 14th International Conference on Genome Informatics pp:3-13,Universal Academy Publishing, 2003.
- D. Xu, P. Dam, D. Kim, M. Shah, E. Uberbacher, and Ying Xu, Characterization of Protein Structrue and Function at Genome-scale using a Computational Prediction Pipeline, accepted to appear in Genetic Engineering: Principles and Methods, Vol 32, Jane Setlow (Ed.), Plenum Press, 2003.
- Y. Xu and D. Xu. Protein threading using PROSPECT: Design and evaluation. Proteins: Structure, Function, and Genetics. 40:343-354. 2000.