Milestone 3: Develop the Knowledgebase, Computational Methods, and Capabilities to Advance Understanding and Prediction of Complex Biological Systems
Research Highlights for Milestone 3: Computing
Background and Strategy
GTL’s central goal is to provide the necessary technologies, computing infrastructure, and comprehensive knowledgebase to surmount the barrier of complexity that prevents translation of genome sequence directly into predictive understanding of function. Genome sequences furnish the blueprint, technologies can produce the data, and computing can relate enormous data sets to models of process and function.
The ultimate goal of every science is to achieve such a complete understanding of a phenomenon that a set of mathematical laws or models can be developed to predict accurately all its relevant properties. Although such capabilities now exist for certain areas of physics, chemistry, and engineering, virtually no biological systems are understood at this level of detail and accuracy. Because theory and computation are limited by the lack of experimental data and the means to verify models quantitatively, their application has had relatively little impact on biology. With the developments described in this plan, the biosciences are poised for rapid progress toward becoming the quantitative and predictive science known as systems biology.
Models can form the foundation for understanding complex systems. They can be applied to such useful ends as developing biological sources of clean energy, cleaning up toxic wastes, and understanding the roles of microbial communities in ocean and terrestrial carbon cycling (i.e., how they sequester carbon and how the processes involved respond to and impact climate change). The key challenge to achieving GTL goals will be development of capabilities for modeling and simulation—capabilities that must be coupled tightly with experimental methods to identify and characterize biological components, their interactions, and the products of those interactions.
The program’s computational component will require developments ranging from more-efficient modeling tools to fundamental breakthroughs in mathematics and computer science, as well as algorithms that efficiently use platforms ranging from workstations to the fastest available computers.
GTL Knowledgebase
Analysis of each new microbe benefits from insights gained through knowledge about all other microbes and life forms. To achieve our goal of understanding life’s full complexity, we can take advantage of the unity of life and its evolutionary history brought about by the common hereditary molecule DNA and the underlying principles and instructions it encodes. Nature’s simplifying principles make this a powerful strategy. Just as a finite number of rules determine the structure and function of proteins, so the higher-order functions of cells seem to emanate from another finite set of principles and interactions. Once successful machines, pathways, and networks arise by evolution, they tend to be preserved, subtly modified and optimized, and then reused as variations on enduring themes throughout many organisms and species. Thus, accumulating detailed information on numerous microbes across a wide range of functionality ultimately will provide the insight needed to interpret these principles. Comparative genomics, founded on these principles, will allow us to predict the functions of unknown microbes by deriving a working model of a cell from its genetic code.
Comparative genomics has been a powerful computational technique in the genome-sequencing era, yielding insights into gene function that provide discoveries and allowing prediction and hypothesis development. In this new era of systems biology, all-against-all comparisons of much more extensive microbial data amassed in the GTL Knowledgebase (see figure above) will accelerate and sharpen our research strategies. Foundation of the knowledgebase, the DNA sequence code will relate the many data sets emanating from microbial systems biology research and discovery. Over time, an intensely detailed description of each gene’s function and regulatory elements will be created from which networks and subsystems—and eventually cellular and community structure and functionality—can be derived. As these capabilities improve, focus of the experimental process and research resources can shift from the study of unknown components and functions to development of a new generation of capabilities for probing system functions by such methods as predicting, testing, and manipulating the role of microbes in ecosystems or designing systems for biofuel production. Along with high-throughput facilities and computing, this strategy is a key element of our approach to reducing a microbial system’s analysis time from many years to months.
Given a knowledgebase with many genes from organisms highly annotated with functional data (cross-referenced to each other), much information about a newly sequenced genome will be at scientists’ fingertips.
Researchers will use the knowledgebase with computation and modeling to drive hypothesis formulation, experiment design, and data collection. The interoperable, open-access knowledgebase will enable quick deduction of any gene’s function or complex biochemical pathways of interest. Insights gained in these studies will help transform biology into a more quantitative and predictive science based on models that synthesize observations, theory, and experimental results. This paradigm combines both discovery and computationally driven hypothesis science as we navigate massive data sets to reveal unforeseen properties and phenomena and derive insights from previously unfathomable complexity.
Building over time to an intensely detailed and annotated description of microbial functional capabilities, the GTL Knowledgebase will assimilate a vast range of microbial data as they are generated. The knowledgebase will grow to encompass program and facility data and information, metadata, experimental simulation results, and links to relevant external data. It also will incorporate existing microbial data, including model microbial systems such as Escherichia coli and Bacilus subtilis, to take advantage of extensive understanding. The power of conservation in biology will be used to leverage and extend our partial knowledge about a few organisms to a more complete understanding of many microbes and their communities. Underlying the GTL Knowledgebase will be an array of databases, bioinformatics and analysis tools, modeling programs, and other transparent resources.
Elements of the GTL Integrated Computational Environment for Biology
To support the achievement of GTL science and mission goals, a number of essential elements of the computational environment will be established. They will include a seamless set of foundational experimentation capabilities to support the pinnacle capability of theory, modeling, and simulation. Especially needed is a rigorous and transparent system for tracking, capturing, and analyzing data within a computing and information infrastructure accessible to the scientific community as end users of the data. The enabling environment for GTL computation consists of six complementary technical components, each with its own supporting roadmap:
- Theory, Modeling, and Simulation
- LIMS and Workflow Management
- Data Capture and Archiving
- Data Analysis and Reduction
- Computing and Information Infrastructure
- Community Access to Data and Resources
Synergisms with Other Agencies and Industries
GTL will leverage information and methods from a variety of sources, including
- Protein structures produced in the Protein Structure Initiative of the National Institutes of Health (NIH)
- Protein Data Bank
- Databases of metabolic processes such as KEGG and WIT
- Hosts of available analytical tools in such areas as molecular dynamics, mass spectrometry, and pathway modeling and simulation
- NIH National Center for Biotechnology Information data and tools
- Industrial vendors and tool developers
Successful production of new technologies and advanced tools for computational biology will require the sustained efforts of multidisciplinary teams, teraflop-scale and faster computers, and considerable user expertise.
Encompassing the entire biological community, this task will involve many institutions and federal agencies, led in many aspects by NIH and the National Science Foundation. A central component of GTL will be the establishment of effective partnerships with these and other agencies and with commercial entities to ensure the widespread adoption of computational tools and standards and to eliminate redundant work.
