Biological data is increasingly being shared between databases using a variety of different data formats. While some of these formats have been developed almost exclusively for communicating between viewers and databases (for example, AGAVE and GAME) and others for sharing annotation as well as viewing (e.g. DAS), many formats are also being used to exchange data between databases (e.g. EMBL, ASN.1, and specialized XML DTDs).
The appeal of a data exchange format is that it is a way of serving data in a uniform, flexible, and easily parsable form. Data exchange formats are also largely self-describing and hence easy to understand. However, agreeing to use a specific exchange format does not solve the data exchange problem by itself. Exporters must map their data into the exchange format, and importers of data must again map from the exchange format into their local format (or more generally, model). Thus data exchange is inextricably tied up with writing mappings (or transformations) between data formats.
Several problems are associated with writing mappings between data formats. First, it is an inherently difficult problem. The writer of the mapping must understand both how the data is being represented in the exchange format and how they are representing it in their own model. Second, semantic information is frequently not captured in a data exchange format. For example, information about keys, foreign keys and constraints is often omitted. Clearly, constructing a mapping must be guided by an understanding of the semantics of the data since otherwise the mapping may cause run-time constraint violations.
As an example of these problems, we having recently been conducting an experiment involving exchanging microarray gene expression data using a developing standard called MAGE-OM/ML . The semantics of MAGE is specified using UML modeling tools (MAGE-OM), however the exchange is effected using an XML representation of the standard. Prior to this standardization effort, a relational database called RAD  had been developed at the Penn Center for Bioinformatics to store gene expression data as well as its associated sample annotation data. When the MAGE-ML standard is finalized, data will be imported from collaborators and exported from RAD using this format. However, each of these data representations - RAD and MAGE - has been developed independently. In our experiment, the following problems emerged:
In performing this experiment, we have also re-affirmed the problems
of performing mappings between data sources. The MAGE standard is specified
in a 125 page document, of which roughly 86 pages are necessary for understanding
the model. There are 17 packages, each of which has between 3 and 20 classes.
The RAD schema has roughly 6 high-level divisions and 112 tables. Understanding
both of these representations and specifying the mapping on the Experiment
package of MAGE took the student working on the project several months,
and this mapping will have to be re-adjusted as both MAGE and RAD are still
evolving (a common problem in bioinformatics). Furthermore, many of the
mappings involved functions on data fields rather than simple correspondences;
for example, a string in RAD must be parsed to capture individual elements
that are mapped to different classes in MAGE. Techniques for automatically
inferring potential connections between classes in the models, and improved
mapping techniques would be extremely helpful.