It is our view that labeled directed graph data models (simple, nested, or hypergraphs) can naturally and usefully capture a wide variety of biological data and queries. We believe that development of general purpose graph data management systems (GDMSs) could become major platforms for development of a wide variety of bioinformatics database systems spanning applications from DNA sequences, chemical structure graphs, to contact graphs and biopathways. The use of a common data model and query language across numerous biological applications is very attractive from both a development and user standpoint. Considerable research and development effort are still required to extend current systems to meet modern requirements.
There are numerous biological datasets which lend themselves naturally to graph data models.
Metabolic pathways: directed cyclic (multi)-graphs
Signaling pathways: directed cyclic graphs
Gene regulatory networks: directed cyclic graphs
Protein interaction networks: , undirected graphs (or hypergraphs)
Taxonomies of proteins, chemical compounds, and organisms, ...: directed acyclic graphs
Partonomies: directed acyclic graphs
Topological Adjacency Relations: directed graphs
Data Provenance Graphs: directed acyclic graphs (possibly hypergraphs)
Chemical structure graphs: a.k.a. chemical bond graphs - undirected cyclic multigraphs
Sequence data and multiple sequence alignments:. linear and directed acyclic (partial order) graphs respectively
Contact Graphs: undirected graphs to represent 3 dimensional structure of proteins
Gene Clusterings: usually directed trees, directed acyclic graphs if overlapping clusters allowed.
Bibliographic citations: directed graphs - mostly acyclic
Hypertext: directed graphs
Graph data model considered by the DBMS and bioinformatics communities vary by the class of graphs allowed: directed vs. undirected, linear, tree, DAGs, directed cyclic graphs, ... A number of authors have adopted nested graph data models to allow hierarchical biopathways, chemical structure graphs encapsulated inside chemical entity nodes, .... The Ozsoyoglu's at CWRU have adopted a hypergraph data model for biopathways, presumably due to advantages of conciseness. Note that commonly one needs to be able to specify labels on both nodes and edges of the graph data model. Often it will be useful if nodes and/or edges are typed and if the node/edge labels can be organized as taxonomies.
There are two major approaches to the representation of graphs: edge lists and adjacency matrices. Edge list are simply lists of pairs of nodes connected by an edge. Adjacency matrices are binary matrices (for graphs) which contain a 1 for element A(i,j) if there is an edge connecting node i to node j. Edge lists are preferred for sparse graph applications. Adjacency matrices are more efficient representations for dense graphs. Most graph data managements (e.g., in the database community, etc.) employ edge list notations.
In biological settings graph queries may be constructed from compositions of the "basic" query types enumerated below. Note that most work in the database community has focused on path queries, and more recently tree queries (e.g., for XML). More general graph matching has received less attention due to potential pitfalls with computational complexity. However, general graph matching, e.g., labeled subgraph isomorphism is widely used in the chemical information retrieval communities for matching chemical structure subgraphs.
Many of these graph queries are motivated by interest in comparative biology over biopathways and other graph data. Comparative biology has proven to be a fruitful approach when the comparisons concern sequence data (DNA, RNA, or protein sequences) or shape comparisons (e.g., of protein structures). We anticipate that as metabolic pathways, signaling pathways and genetic regulatory networks are determined for many organisms there will be increasing interest in a variety of comparisons of the corresponding graphs.
Is there enough commonality across applications and a sufficiently large market for graph data management systems in:
For further information, literature citations, more detailed explanations of queries, etc. on graph data management see our the web page: http://www.lbl.gov/~olken/graphdm/graphdm.htm for our project, "Biopathways Graph Data Manager".
This work is supported by
This page is written by: Frank Olken . Email: olken@lbl.gov . Last updated: Thur. Jan 23, 2003