Graph Data Management for Molecular Biology

Frank Olken
Computational Sciences Research Division
and Arkin Lab
Lawrence Berkeley National Laboratory
Workshop on Data Management for Molecular and Cell Biology
Bethesda, Maryland
Feb. 2-3, 2003


It is our view that labeled directed graph data models (simple, nested, or hypergraphs) can naturally and usefully capture a wide variety of biological data and queries. We believe that development of general purpose graph data management systems (GDMSs) could become major platforms for development of a wide variety of bioinformatics database systems spanning applications from DNA sequences, chemical structure graphs, to contact graphs and biopathways. The use of a common data model and query language across numerous biological applications is very attractive from both a development and user standpoint. Considerable research and development effort are still required to extend current systems to meet modern requirements.

Graph Datasets

There are numerous biological datasets which lend themselves naturally to graph data models.

Graph Data Models

Graph data model considered by the DBMS and bioinformatics communities vary by the class of graphs allowed: directed vs. undirected, linear, tree, DAGs, directed cyclic graphs, ... A number of authors have adopted nested graph data models to allow hierarchical biopathways, chemical structure graphs encapsulated inside chemical entity nodes, .... The Ozsoyoglu's at CWRU have adopted a hypergraph data model for biopathways, presumably due to advantages of conciseness. Note that commonly one needs to be able to specify labels on both nodes and edges of the graph data model. Often it will be useful if nodes and/or edges are typed and if the node/edge labels can be organized as taxonomies.

Graph Data Representation

There are two major approaches to the representation of graphs: edge lists and adjacency matrices. Edge list are simply lists of pairs of nodes connected by an edge. Adjacency matrices are binary matrices (for graphs) which contain a 1 for element A(i,j) if there is an edge connecting node i to node j. Edge lists are preferred for sparse graph applications. Adjacency matrices are more efficient representations for dense graphs. Most graph data managements (e.g., in the database community, etc.) employ edge list notations.

Graph Queries

In biological settings graph queries may be constructed from compositions of the "basic" query types enumerated below. Note that most work in the database community has focused on path queries, and more recently tree queries (e.g., for XML). More general graph matching has received less attention due to potential pitfalls with computational complexity. However, general graph matching, e.g., labeled subgraph isomorphism is widely used in the chemical information retrieval communities for matching chemical structure subgraphs.

Many of these graph queries are motivated by interest in comparative biology over biopathways and other graph data. Comparative biology has proven to be a fruitful approach when the comparisons concern sequence data (DNA, RNA, or protein sequences) or shape comparisons (e.g., of protein structures). We anticipate that as metabolic pathways, signaling pathways and genetic regulatory networks are determined for many organisms there will be increasing interest in a variety of comparisons of the corresponding graphs.

Open Research Questions

Practical Questions

Is there enough commonality across applications and a sufficiently large market for graph data management systems in:

for graph data managements systems to ever become a real business, rather than just an exotic boutique class of data management systems?

Further information

For further information, literature citations, more detailed explanations of queries, etc. on graph data management see our the web page: for our project, "Biopathways Graph Data Manager".

Funding Acknowledgements

This work is supported by

This page is written by: Frank Olken . Email: . Last updated: Thur. Jan 23, 2003