Data Analysis and Reduction
Data Analysis and Reduction Roadmap
Objective: Provide analysis capabilities for systems biology data to provide insights, input, and parameters to systems models and simulations.
Bioinformatics encompasses a range of computational analyses characterized in part by reliance on data, especially genomic and proteomic data, as the central feature. Sequence analysis, largely the prediction of genes and gene function by homology, has been a core task.
But in GTL, bioinformatics describes a broader set of investigations that will consider a wide variety of data types and sources—genome sequences, proteomics, metabolomics, expression, pathways, and simulation data. Many challenges are emerging as the amount and complexity of data are increasing exponentially and the types of analyses across multivariate data sets also become more complex. Many of these analyses can no longer be supported by local computing capabilities.
Infrastructure
Data-analysis infrastructure will support an environment for creating and managing sophisticated, distributed data-mining processes. The unprecedented amount and complexity of biological data require that computational analysis is a key component of GTL (and systems biology in general). By developing the necessary tools and tool frameworks, GTL will allow biologists to derive inferences from massive amounts of heterogeneous and distributed biological data. Using intuitive visual interfaces, developers and data analysts will be able to program new data-mining applications or open existing application templates that easily can be customized to a given problem’s unique requirements. Such processes will have both application and web-based streamlined interfaces. An infrastructure should encompass a large repository of analysis modules including sequence analysis, gene expression, phylogenetic tree, and mass spectrometry.
An objective of GTL is to provide high-throughput experimental data that can be used for rapid functional annotation of genomes. Understanding functions of microbes and microbial communities depends critically on the ability to develop and validate models and drive simulations based on experimental data. Massive data sets must be incorporated into systems simulations and models to infer function of genes and proteins. Such analyses will require advances in mathematical methods and algorithms capable of incorporating experimental data produced by a variety of techniques, including NMR, MS, X ray, neutron scattering, various microscopies, biofunctional assays, and many more. GTL will develop the methodology necessary for seamless integration of distributed computational and data resources, linking both experiment and simulation and taking steps to ensure that high-quality, complete data sets are linked to the validation of models of metabolic pathways, regulatory networks, and whole-cell functions.
Sequence annotation and comparative analyses across multiple genomes are recurring computational tasks that require a high-performance computing infrastructure to ensure that regular information updates are part of the most current annotation and to facilitate interactive exploratory genome analyses. Finding regulatory elements, an unsolved research problem in even the simplest genomes, is expected to involve significant computational and mathematical challenges. Some analysis of regulatory regions can be accomplished by large-scale genome comparisons. Significant research challenges remain in high-level annotation, including assignment of functions to every gene found in whole-genome sequences. This is particularly difficult because pathway databases are incomplete and microbial genomes encode for metabolic pathways about which very little biochemical data exist. At this time, 40 to 60% of genes found in new genomic sequences do not have assigned functions. Some functions can be inferred by computational-structure determination and protein folding, but a wide range of research problems remain to be solved in this area. Computational methods will have a major role in the functional annotations of genomes, a necessary first step in developing higher-level models of cellular behavior. GTL will continue development of automated methods for the structural and functional annotation of whole genomes, including research into new approaches such as evolutionary methods to analyze structure and function relationships.
Examples of Analyses and Their R&D Challenges for GTL Science
GTL encompasses many types of data, each with algorithm research and development challenges in analyzing data for a broad range of purposes. Examples of objectives:
-
Improve automated genome sequence annotation for microbes and
microbial communities
- New algorithms with improved comparative approaches to annotate organism and community sequences, identifying, for example, promoter and ribosome-binding sites, repressor and activator sites, and operon and regulon sequences
- Protein-function inference from sequence homology, fold type, protein interactions, and expression
- Automated linkage of gene, protein, and function catalog to phylogenetic, regulatory, structural, and metabolic relationships
-
Identify peptides, proteins, and their post-translational
modifications of target proteins in MS data
- New MS identification algorithms for tandem MS
-
Quantitate changes in cluster expression data from arrays or
MS
- New expression data-analysis algorithms
-
Automatically identify interacting protein events in
fluorescence resonance energy transfer (FRET) confocal
microscopy
- New automated processing of images and video to interpret protein localization in the cell and to achieve high-throughput analysis
-
Reconstruct protein machines from 3D cryoelectron microscopy
- New automated multi-image convolution and reconstruction algorithms
-
Compare metabolite levels under different cell conditions
- Algorithms for metabolite method analysis, both global and with spatial resolution
-
Improve general R&D
- Software engineering principles and practices developed and adopted for GTL software; modular, open source
- Development of versions of analysis tools suitable for massively parallel processing and large-cluster computing environments




