Report on the Computational Infrastructure Workshop (January
22-23, 2002) for the Genomes to Life Program cont'd
Introduction
Genomes to Life (GTL), a new program within DOE, seeks to identify and characterize the molecular machines of life, the gene regulatory networks that control them, and complex microbial communities. Cutting across these goals is the need to develop high-performance computational analysis and modeling capabilities and an infrastructure to support them.
DOE has long played a leading role in exploiting high-performance computing to accelerate advances in many scientific and application areas, including computational biology. Computational biology, however, has an unprecedented range of computing needs that make a well-planned infrastructure essential to achieving GTL's ambitious goals. These needs include the ability to perform informatics analysis on a diverse collection of distributed data sets produced by a variety of experimental methods, simulations that consume months of supercomputer time, and biological phenomena that no one yet knows how to model. The infrastructure for biology applications thus must not only provide high-speed computation for large-scale calculations but also must be compatible with much smaller scale calculations carried out on individual investigators' desktops. Such an infrastructure should be flexible, adaptable, and responsive to biology's evolving needs.
A workshop was held January 22-23, 2002, in Gaithersburg, Maryland, to analyze and document computational needs for the successful execution of the GTL program. The 34 attendees addressed questions in several key areas to identify the resources required and formulate a plan. This report provides a vision and some specific actions recommended to reach these goals.
The principal finding of workshop attendees was that only through computational infrastructure tuned and dedicated to the needs of biologists, coupled with new enabling technologies and applications, will it be possible to "move up the biological complexity ladder" and tackle the next generation of challenges.
Importance of Computational Infrastructure to GTL Program
High-performance computing is essential to the type of high- throughput experimental biology that has emerged in the last 10 years. This was demonstrated by the success of the most visible such application to date-genomic sequencing. Sequence assembly and annotation have greatly extended the scale of bioinformatics and provided the incentive to establish a huge investment in and significant role for high-performance computing. Large computer farms have been established to provide capability for bioinformatics applications at numerous research institutions, including private companies, government laboratories, and universities. Computational needs for the next generation of challenges, however, may require a tighter coupling among processors.
As evidenced by GTL, biology is undergoing a major transformation that will be enabled and ultimately driven by computation. This can occur, however, only if an appropriate computational infrastructure is established. The data analysis and models required to understand molecular machines and microbial communities will become more computationally complex and heterogeneous and will require coupling to enormous amounts of experimentally obtained data. Such unprecedented problems can easily exceed the capabilities of next-generation (petaflop) super-computers. The following table presented at the workshop illustrates this point.
| Problem | Computing Speed | Storage | Network |
|---|---|---|---|
| Genome Assembly | >10 Tflops to keep up with sequence rates | 300 TB per genome | 100 MB/s |
| Protein Structure Prediction | >100 Tflops per protein set in a single microbe | Petabytes | 500 Mb/s |
| Classical Molecular Dynamics | 100 Tflops for each DNA protein interaction | 10s of petabytes | 2.4 Gb/s |
| First Principles Molecular Dynamics | 1 Pflops per reaction in enzyme active site | 100s of petabytes | 10 Gb/s |
| Simulation of Biological Networks | >1 Tflops for small biological network | 1000s of petabytes | unknown |