Workshop on Data Management for Molecular and Cell Biology

Feb 2-3, 2003

 

Christian J. Stoeckert, Jr., Ph.D.

Department of Genetics and Center for Bioinformatics

University of Pennsylvania

1415 Blockley Hall

423 Guardian Dr.

Philadelphia, PA 19104

stoeckrt@pcbi.upenn,edu

 

 

Common objects: Think global, act local

 

The adage, Òthink global, act localÓ exhorts individuals to change the world starting with actions within ones local community. The research challenges for data management for molecular and cell biology can be similarly addressed. Local communities have their own cultures, needs, and priorities. This is as true for communities involved in the various knowledge domains of molecular and cell biology (e.g., genomics, structural biology, molecular phylogeny, pharmacology, etc) as it is for geopolitical communities. In the global research community encompassing the many molecular and cell biology knowledge domains we need to communicate information in order to integrate data on a geneÕs chromosomal location, the structure of the protein that the gene encodes, and the small molecules that bind to that protein. Communication requires a common standard not only at the syntactic level but also at the semantic level. A top down approach of imposing a single standard for all molecular and cell biology knowledge domains is impractical if not impossible to enforce. A bottom up approach of each knowledge domain generating its own standards would take advantage of the work already done in those local communities and best utilize the expertise within those communities. Standards for protein structure have been generated through the efforts of the Research Collaboratory for Structural Bioinformatics (RCSB). Standards for assigning and describing molecular function, biological processes, and cellular component have been established through the Gene Ontology (GO) Consortium. Microarray standards have been established through the efforts of the Microarray Gene Expression Data (MGED) Society. While these and other standards exist, certainly they are not available for all molecular and cell biology knowledge domains.

 

The challenges for a bottom-up approach to generating a common standard for molecular and cell biology data management are to first establish standards in the various communities or knowledge domains comprising this broadly-defined area of research and secondly to have these different standards work together. These challenges are related in that the standards that are developed at the community or local level must be compatible at the global level. To think global in this sense is to have local standards efforts (within knowledge domains) look to and learn from existing standards efforts. The existing standards efforts to consider should not be restricted to molecular and cell biology (e.g., GO) but should also include standards effort in the mediums used by molecular and cell biologists, computational biologists and bioinformaticists. Examples include the standards efforts in computer industry specifications by the Object Management Group (OMG), in bioinformatics programming by the Open Bioinformatics Foundation, and in web technologies by the World Wide Web Consortium (W3C). The challenge of building local data standards thus includes consciousness raising of other efforts and their relevance.

 

If the Òthink global, act localÓ approach is taken then ultimately data managers for molecular and cell biology will share common objects. Objects are defined here as instances of some concept (class). Common objects are ones that can be shared and understood between data systems. Despite the popularity of relational data management systems and indexed text files, most computational biology and bioinformatics systems have an object layer and are capable of generating some form of eXtended Mark-up Language (XML). Objects provide an abstraction that permits heterogeneity in schema or data representation for individual data systems. Thus data in legacy systems can be mapped to common objects for exchange with other systems such as is the case for the MicroArray Gene Expression (MAGE) object model. A standard representation for objects exists in the Unified Modeling Language (UML). Standards for exchanging objects exist such as the Common Object Request Broker Architecture (CORBA) by the OMG and Simple Object Access Protocol/XML protocol (SOAP/XMLP) by the W3C. Thus the means for sharing objects exist; the challenge is to create standard ones. There is certainly overlap between different molecular and cell biology domains and an associated challenge to encouraging the generation of domain standards is to minimize the overlap in standard objects. The Global Open Biological Ontologies (GOBO) is one effort to encourage common data format usage and minimal overlap of representation. GOBO is also requiring open or freely available standards which is a practical necessity for common objects.

 

Abstraction of molecular and cell biology data to commonly agreed upon objects that can be exchanged though a popular language such as XML is the central point of this paper. Others may feel that the key challenges are the limitations in expressivity of current data types that can be exchanged. Whether new data types or new domain-specific standards in data representation are developed, it should be kept in mind that these local actions should still allow data systems to think globally (i.e., across domains with common objects).