Workshop on Data Management
for Molecular and Cell Biology
Feb 2-3, 2003
Christian J. Stoeckert, Jr.,
Ph.D.
Department of Genetics and
Center for Bioinformatics
University of Pennsylvania
1415 Blockley Hall
423 Guardian Dr.
Philadelphia, PA 19104
Common objects: Think
global, act local
The adage, Òthink global,
act localÓ exhorts individuals to change the world starting with actions within
ones local community. The research challenges for data management for molecular
and cell biology can be similarly addressed. Local communities have their own
cultures, needs, and priorities. This is as true for communities involved in
the various knowledge domains of molecular and cell biology (e.g., genomics,
structural biology, molecular phylogeny, pharmacology, etc) as it is for
geopolitical communities. In the global research community encompassing the
many molecular and cell biology knowledge domains we need to communicate
information in order to integrate data on a geneÕs chromosomal location, the
structure of the protein that the gene encodes, and the small molecules that
bind to that protein. Communication requires a common standard not only at the
syntactic level but also at the semantic level. A top down approach of imposing
a single standard for all molecular and cell biology knowledge domains is
impractical if not impossible to enforce. A bottom up approach of each
knowledge domain generating its own standards would take advantage of the work
already done in those local communities and best utilize the expertise within
those communities. Standards for protein structure have been generated through
the efforts of the Research Collaboratory for Structural Bioinformatics (RCSB).
Standards for assigning and describing molecular function, biological
processes, and cellular component have been established through the Gene
Ontology (GO) Consortium. Microarray standards have been established through
the efforts of the Microarray Gene Expression Data (MGED) Society. While these
and other standards exist, certainly they are not available for all molecular
and cell biology knowledge domains.
The challenges for a
bottom-up approach to generating a common standard for molecular and cell
biology data management are to first establish standards in the various
communities or knowledge domains comprising this broadly-defined area of
research and secondly to have these different standards work together. These
challenges are related in that the standards that are developed at the
community or local level must be compatible at the global level. To think
global in this sense is to have local standards efforts (within knowledge
domains) look to and learn from existing standards efforts. The existing
standards efforts to consider should not be restricted to molecular and cell
biology (e.g., GO) but should also include standards effort in the mediums used
by molecular and cell biologists, computational biologists and
bioinformaticists. Examples include the standards efforts in computer industry
specifications by the Object Management Group (OMG), in bioinformatics
programming by the Open Bioinformatics Foundation, and in web technologies by
the World Wide Web Consortium (W3C). The challenge of building local data standards
thus includes consciousness raising of other efforts and their relevance.
If the Òthink global, act
localÓ approach is taken then ultimately data managers for molecular and cell
biology will share common objects. Objects are defined here as instances of
some concept (class). Common objects are ones that can be shared and understood
between data systems. Despite the popularity of relational data management
systems and indexed text files, most computational biology and bioinformatics
systems have an object layer and are capable of generating some form of
eXtended Mark-up Language (XML). Objects provide an abstraction that permits
heterogeneity in schema or data representation for individual data systems.
Thus data in legacy systems can be mapped to common objects for exchange with
other systems such as is the case for the MicroArray Gene Expression (MAGE)
object model. A standard representation for objects exists in the Unified
Modeling Language (UML). Standards for exchanging objects exist such as the Common
Object Request Broker Architecture (CORBA) by the OMG and Simple Object Access
Protocol/XML protocol (SOAP/XMLP) by the W3C. Thus the means for sharing
objects exist; the challenge is to create standard ones. There is certainly
overlap between different molecular and cell biology domains and an associated
challenge to encouraging the generation of domain standards is to minimize the
overlap in standard objects. The Global Open Biological Ontologies (GOBO) is
one effort to encourage common data format usage and minimal overlap of
representation. GOBO is also requiring open or freely available standards which
is a practical necessity for common objects.
Abstraction of molecular and
cell biology data to commonly agreed upon objects that can be exchanged though
a popular language such as XML is the central point of this paper. Others may
feel that the key challenges are the limitations in expressivity of current
data types that can be exchanged. Whether new data types or new domain-specific
standards in data representation are developed, it should be kept in mind that
these local actions should still allow data systems to think globally (i.e.,
across domains with common objects).