Creating an Integrated Computational Environment for Biology
An Essential Foundation
GTL Integrated Computational Environment for Biology: Using and Experimentally Annotating GTL’s Dynamic Knowledgebase. At the heart of this infrastructure is a dynamic, comprehensive knowledgebase with DNA sequence code as its foundation. Offering scientists access to an array of resources, it will assimilate a vast range of microbial data and knowledge as it is produced.
Computation is essential to the GTL program goal of achieving a predictive understanding of microbial cell and community systems. Computing and information technologies allow us to surmount the barrier of complexity that separates genome sequence from biological function. The integrated GTL computational environment will link data of unprecedented scale, complexity, and dimensionality with theory, modeling, simulation, and experimentation to derive principles and develop and test biosystems theory. GTL computation will employ data-intensive bioinformatics, compute-intensive molecular modeling, and complexity-dominated cellular systems modeling.
Models and simulations represent an ultimate level of integrated understanding. A key goal for cell modeling is to predict cell phenotype from the cell’s genotype and extracellular environmental information. Such predictions, resulting from comparative genomic studies, will include cell ultrastructure, morphology, motility, metabolism, life cycle, and behavior under a wide range of environmental conditions. These models not only will be descriptive and phenomenological but also will be predictive at multiple levels of detail. Although this vision is still a distant goal, we can take important steps within our current scope of understanding and create experimental and computational capabilities that will have dramatic near-term impact. Even simple models can be used to help guide experiments, and the results of iterating among theory, modeling and simulation, and experimentation will enable us to develop (albeit slowly at first) an integrated understanding of cellular systems. This understanding undoubtedly will be framed initially in some qualitative form, but over time and with additional experiments and improved analysis methodologies, it will become much more quantitative.
A comprehensive knowledgebase will be at the heart of GTL systems microbiology. The knowledgebase foundation is the DNA sequence code that will relate the many data sets emanating from microbial systems biology research and discovery. Building over time to an intensely detailed and annotated description of microbial functions, the GTL Knowledgebase will assimilate a vast range of microbial data as it is produced. It will grow to encompass program and facility data and information, metadata, experimental simulation results, and links to relevant external data and tools. Underlying the knowledgebase will be an array of databases, bioinformatics and analysis tools, modeling programs, and other transparent resources.
Some examples of core capabilities required by the GTL program follow.
- Bioinformatics: Collecting and Analyzing Data on Cellular Components. The term “bioinformatics”includes a range of computational analyses characterized in part by reliance on data, especially genomic and proteomic data, as the critical investigative feature. Sequence analysis, largely the prediction of genes and gene function by homology, has been a core task. GTL will generate many such data types as measurements of protein complexes, protein expression, and microbial cell and community metabolic capabilities. Vast new data sets must be correlated or annotated to genome data and archived to provide foundational data for computer models of biochemical pathways, entire cells, and, ultimately, microbial ecosystems.
- Molecular Measurements and Modeling: Revealing Processes Carried Out by Cellular Components. GTL seeks to understand fully the cell’s biological machinery and its relationships with other cells and the environment. To reach this goal, investigators must know and be able to computationally model and test concepts in which cellular components interact directly with each other and with other molecules in a cell. They also must know how proteins dock structurally to form a complex and how the proteins of a complex interact dynamically to accomplish a biological function. For example, detailed characterization of protein complexes is the prerequisite for understanding the functions of molecules, cells, regulatory complexes, and networks as well as the interactions of cell surface proteins and complexes with the environment.
- Cell and Community Modeling: Coalescing the Cell’s Components into a Whole-Systems Predictive Understanding. Biosystem models encapsulate our understanding of biology, and simulation is becoming a key tool for furthering understanding at the systems level. Through computational analysis of predictive mathematical models, we will understand how microbial organisms and communities may be manipulated to solve problems, how microbes regulate the expression of genes involved in environmental interactions, and how protein complexes are assembled to carry out important processes. Predictive models also will prove most useful in integrating and summarizing the vast amounts of data to be generated by the GTL program.
The computational biology environment will provide the networking “nervous system”to connect experimental and computational facilities with the large, geographically dispersed community of biology researchers, advancing collaboration and education. The environment will make tractable the project’s inherent science diversity and its expected scale and duration. Computing will be tailored to meet the needs of biological research with transparent available tools linked to high-quality and interoperable databases.
DOE’s experience and capabilities in harnessing computing for science goals already have led to such breakthroughs in biology as annotation and sequence-assembly tools. GTL also will leverage biocomputing developments in other agencies and institutions to contribute to the creation of sophisticated concepts and tools for advancing systems biology worldwide.
This suite of computational biology pages describes the attributes and uses of the community- accessible GTL computational environment, presenting the strategy and roadmaps for establishing essential capabilities that tie together GTL scientists and research facilities. As described in the supporting roadmaps, establishing these capabilities will be part of a rigorous development process involving the scientific community, other federal agencies, and industry.
Capabilities for an Integrated Computational Environment
To support the achievement of its science and mission goals, GTL must establish a number of essential elements in building the program’s computational environment. These components include a seamless set of foundational capabilities to support the pinnacle capability of theory, modeling, and simulation. They include a rigorous and transparent system for tracking, capturing, and analyzing data with a computing and information infrastructure accessible to the scientific community.
- Theory, Modeling, and Simulation Coupled to Experimentation of Complex Biological Systems: Build concepts and models of microbial cells and communities that capture and extend our knowledge, based on a combination of experimental data types. Test and validate component models and use integrated models to understand mechanisms and explore new hypotheses or conditions to design experimental campaigns.
- Sample and Experimental Tracking and Documentation – Laboratory Information Systems (LIMS) and Workflow Management: Provide systems for experiment design, sample specification, sample tracking and metadata recording, workflow management, process optimization and documentation, QA, and sharing of such data across facilities or projects.
- Data Capture and Archiving: Capture bulk data from many different measurements and instruments in large-scale data archives.
- Data Analysis and Reduction: Provide analysis capabilities for systems biology data to enable insights, input, and parameters for systems models and simulations.
- Computing and Information Infrastructure: Furnish hardware and software environments to support analysis, data storage, and modeling and simulation at the scales required in GTL.
- Community Access to Data and Resources: Provide community access to data, models, simulations, and protocols for GTL. Allow users to query and visualize data, use models, run simulations, update and annotate community data, and combine community data and models with their local databases and models.
A series of computational biology workshops has been held for the Genomics:GTL program.
- Creating an Integrated Computational Environment for Biology; 2005 GTL Roadmap (PDF, 1357 kb)
- Report on Three Genomes to Life Workshops: Data Infrastructure, Modeling and Simulation, and Protein Structure Prediction; Gaithersburg, Maryland; July 22–24, 2003 (PDF file, 834 kb)
- Mathematics for GTL Workshop, Gaithersburg, Maryland; March 18-19, 2002
(PDF
file, 266 Kb)
- Executive Summary (PDF file, 23 Kb)
- Computer Science for GTL Workshop, Gaithersburg, Maryland; March 6-7, 2002
(PDF
file, 272 Kb)
- Executive Summary (PDF file, 32 Kb)
- Computational Infrastructure Workshop; January 22-23, 2002 (html-web or PDF file, 497 Kb) Presentations available
- Presentation by Gary Johnson, October 26, 2001, to the Office of Advanced Scientific Computing Research Advisory Committee in HTML (web) and PowerPoint
- Visions for Computational and Systems Biology Workshop for the Genomes to
Life Program; September 6-7, 2001 (Notes with Executive Summary PDF
file, 48 Kb)
- Executive Summary (PDF file, 23 Kb)
-
First GTL Computational Biology Workshop; August 7-8, 2001 [HTML
(web) files or PDF
file, 656 Kb]
- Executive Summary (PDF file, 18 Kb)


