In the review article "What Goes Around Comes Around," Michael Stonebraker and Joey Hellerstein describe the history of data models put forward by the database research community. The authors explain how early data models such as IMS and CODASYL were complex for programmers to use, requiring record-at-a-time programming. These early models had graph or tree-based relations between records, which made it difficult to alter logical schemas. The relational model was simpler in design, making it easier to implement declarative data manipulation languages and to provide for data independence. Unfortunately, modern databases research has drifted toward XML and semi-structured data systems, which are even more complex than early graphical models like CODASYL.|
It is important for researchers to understand the history of data models, to avoid repeating mistakes made in failed systems of the past. Researchers should be cautious about proposing complex data models, with constructs such as union types or multiple inheritance, because they are likely to make databases less understandable and harder to manage, with little benefit over relational models.
The paper's main contribution is not original research, but an overview of previous studies of new data models, featuring critical comments on why various models were successful or not. Author Michael Stonebraker in particular, has made many significant research contributions, such as INGRES, an early implementation of a relational database management system, and Postgres, a successful object-relational database management system. This paper makes an argument that simple data models are desirable, because they are more understandable, and they are easier to alter in physical or logical schema without upsetting higher-level views. Moreover, the authors claim that new data model proposals will succeed only if they are significantly better than the status quo in performance or flexibility, or if a major industry player backs them.
The authors point to future goals for database research that may be more useful than semi-structured data stores, considering the limited range of applications for semi-structured data. For example, the authors point out that version control and data lineage tools are lacking in the databases most businesses and scientists use today. One fault in the authors' criticism of semi-structured database research is that they claim the main uses of semi-structured data are want ads and resumes. In fact, many web sites today treat articles, photos with metadata, or web pages as semi-structured data, and use non-relational databases like MongoDB to store them.
The paper outlines different data modelling proposals spanning over a period of three and a half decades, enumerating them in a group of nine eras. It talks about the inception of the research during late 1960s with a Hierarchical model starting the race. It then discusses its shortcomings and its competitor CODASYL (Committee on Data Systems Languages) which formulated a Directed Graph approach against their hierarchical one.|
The limelight then shifts to the creation of one of their major challengers and the base of other data models over the past decades, the Relational approach. This approach focused on a simpler data structure against CODASYL’s multi-dimensional hyperspace. It sparked off “the great debate” between them and the CODASYL advocates which lasted for a good part of 1970s. The Relational approach prevailed due to various reasons, major winner of which was its adoption by IBM, who was the leader in the marketplace during the face-off.
The Relational approach saw many enhancements and modifications to it over the next few years. Entity-Relationship model was introduced which became popular as a database design tool but failed in the database implementation field due to its high complexity. Then R++ era provided minor enhancements to the Relational approach all the while challenging its widespread usability.
The paper then summarizes the different eras including authors’ personal favorite - Object Relational Era as well as their lesser approved XML Data model. Object Relational model provided the consumers with extensible options so that one can create custom rules to suit one’s requirements. On the other hand, XML model didn’t provide any significant improvements over the OR approach, gaining the disapproval of the authors for its implementation.
The authors provide their opinion on which model is the best. They have a cynical take on the current data model since they feel it is merely a union of all the past models. They point out its similarity with the different approaches over the years. They speculate the survival of XML model over the next few years and take a negative stand towards its usefulness. In their opinion, the XML model is popular since it provides security options to the users. I agree with the authors on the fact that people should review the history before contributing to avoid a “What goes around comes around” scenario. As for their critique on XML, I will say it will survive for the coming years based on its growing popularity.
This survey goes through various idea about databases methodologies. Moreover it emphasizes that only a few data modeling ideas have been around a long time and later proposals bear a strong similarity to certain earlier proposals which echoes the title, what goes around comes around.|
It starts off with IMS, a hierarchical data model. It consists of record types, and these record types are arranged in a tree such that except the root each record type has a unique parent record type. It facilitates a simple data manipulation language, DL/1. Each record in an IMS data base has a hierarchical sequence key which provides an order of all records. However this requirement have two undesirable properties:
1. Information is repeated which is not only waste space and easy to have inconsistent data
2. Tree structured data models are restricted, e.g. existence depends on parents
Moreover the DL/1 programming was complex and some of the physically organization impose restrictions on DL/1.
Then CODASYL, a directed graph data model, relaxes the tree restriction in hierarchical data model, but the set of edges only depicts binary relation and fail to capture many-to-many relation. Through we can add the relation itself as a record to fulfill the requirement, the complexity of the database will be overwhelming. Moreover CODASYL model is already considerably more complex than the IMS data models.
Against this backdrop, Ted Codd proposed the relational model in 1970 which was focused on providing better data independence. Data is represented in terms of tuples, grouped into relations, which provide a declarative method for specifying data and queries(first order predicated logic?). Although there is a great debate between relational model and previous models. The revolution of minicomputer give some tract to relational camp, and the market settled this debate.
In the mid '70 the entity-relationship data model is proposed as an alternative to the relational. It can be thought of a collection of instances of entities having attributes and there could be relationships between entities. It's wildly successful in database schema design, but not throughout the database field. After that, a substantial collection of addition were proposed, none of which gained substantial market traction partially due to the increased complexity.
XML data model is intended to deal with the structure of formatted documents. This survey doesn't provide a concrete description how does XML work because it has too many facets, but it gives us a taste that how complicated the model would be given their ambition. The debate between the XML and relational crow bears a suspicious resemblance to the first debate. The author claims the history repeats, and uses "if we don't start learning something from history, we will be condemned to repeat it again."
This paper summarizes a 35 years history on data model proposals into 9 different eras, and gives a discussion about the proposals in each era including the lessons learned during the exploration of the proposals. One interesting thing is that history seems to repeat itself. In each era, most proposals share the same data modeling ideas and even have been around for a long time. It also applies to the current XML era, because it shows a great similarity to the CODASYL proposal which was proposed in 1970’s and failed after the “Great Debate” due to its complexity.|
The problem here is that researchers or developers are replaying the history. The reason might be that some researchers did not learn previous proposals well so that they had limited understanding of previous data modeling ideas. This problem is important because repeating history won’t help much on the development of data modeling, and walking a “full circle” leaves us back to where we were decades ago.
The main contribution of this paper is that it alerts the researchers not to repeat history by giving a detailed discussion about proposals in each era and the lessons learned behind them. It helps readers to understand the history better and realize that current XML era is replaying the history. Take the current XML era for example: We started off with a complex data model, Directed Graph (CODASYL), and after the “Great Debate” we found the relational model, the simpler one, has more advantages. After several decades, now we meet “Great Debate” again, in which relational model is compared with XML which is even more complex than CODASYL.
An interesting observation: this paper delivers the message that we are replaying the history, however, I don't think it is a “complete” repeating. At the end of the paper, it mentioned Schema Later from the semi-structured data camp is probably a niche market. I think sometimes market would play an important role, and it might be possible that Schema Later performs extremely well in marketplace so that we might change to the more complex one. Also, sometimes “good” thing might come out from repeating the history. We never know what might come out if we don’t give a try. However, we must have a good understanding about the previous design and be smarter not to replay the “bad” thing.
To avoid replaying histories mistakes, Stonebraker and Hellerstien summarize database research over the last three decades, highlighting the various successes and failures over the years.|
Hierarchical (IMS): late 1960’s and 1970’s
• Tree-structure challenging for certain datasets. Can lead to:
o Repeated information
o “Corner cases”
• There exists path dependence and physical data dependence issues
Network (CODASYL): 1970’s
• The directed graph model is more flexible that the previous tree model, but also more complex.
• Can only handle binary relationships, which can lead to issues
• Poor logical and physical data dependence
Relational: 1970’s and early 1980’s
• Simple data structure (tables), high level set-at-a-time DML, no physical storage proposal
• Triggered the “great debate”
• Success of VAX, non-portability of CODASYL, and IBM’s new DB/2 lead to SQL winning debate.
• Box and arrow representation
• Easily mappable to relational model
• No impact, useful only as conceptual model for schema’s
Extended Relational (R++): 1980’s
• Extensions to relational model
• Negligible performance improvement lead to downfall
Semantic: late 1970’s and 1980’s
• Focused on classes and inheritance
• Very complex
• Negligible performance like R++ lead to downfall
Object-oriented: late 1980’s and early 1990’s
• Persistent programing (C++)
• Lacks query language and transaction management
• Tedious, difficult to program, and error prone
• Focused on good performance
• Absence of leverage, no standards, management issues, and no Esperanto lead to downfall
Object-relational: late 1980’s and early 1990’s
• Handles GIS data
• Stored procedures
• UDT and UDF allowed for performance improvements
Semi-structured (XML): late 1990’s to present
• Schema later
o Systemic heterogeneity: trend is to move from schema last to schema first scenarios
o Schema evolution (exploratory and data lineage)
• Complex graph-oriented data model
After segmenting the last 30 years into nine eras, the authors started to see a “striking resemblance” between XML advocates and the people who fought for CODASYL years ago. This new XML and relational database debate will hopefully avoid repeating history and heed the authors proposed 18 lessons.
I found this papers explanation of the “great debate” to be the strongest and most interesting part of the paper. It was very well written, clearly outlining themes that have been present throughout computer science, the tradeoff between user friendliness and efficiency.
The main drawback to the paper is its age. In a world where technology changes so quickly, there are always new inventions and problems to be solved. I would be interested in what the authors have to say about the major issues regarding large scalable systems.
In this paper, the authors give us a brief history of the many different data models proposed for databases management systems. The goal of this whirlwind tour is to convince the reader that the current XML, or semi-structured data model, is seriously flawed.|
The authors start by showing the hierarchical and directed graph systems. These systems suffered from being far too complicated to program, as well as suffering from problems with physical and logical data independence.
The authors then describe the relational model, and the "great debate" that it sparked. While the benefits of the simple data model, query optimizer, and logical+physical data independence made it the favorite of academics, IBM made it the dominant data model when it announced DB/2.
The next eras deal with extensions and additions to the relational model, most of which had little impact. For example, features such as set-value types ended up going no where, because they were easily replicated with foreign keys.
The next eras were dramatically different. The semantic model was proposed, but had little longtime influence. The OO model rose with the rise of OO programming and C++, and the "impedance mismatch" between the OO in-memory structures (pointers), and the on disk, table representation in the database.
However, the next era essentially solves this problem. The advent of the Object-relational era allowed programmers to write stored procedures within the database to map their data structures to tables, as well as adding many another new features on top of the relational model.
The authors then move on to describing the current XML and semi-structured data. They described how the semi-structured idea is fundamentally flawed - your "gloves" are my "latex hand protectors". The query languages devised for this semi-structured data are also very complex. I had to learn XQuery for 485, so I completely understand where authors are coming from here. While I think the authors got the format wrong - JSON seems to be the "in" data model now - the problems with XML/XQuery are also very present with JSON. These data types are far too complex, and similar to the hierarchical and directed graph systems of the past.
The authors end on the paper by showing us that every complex data model has failed, while the only data models that have succeed since the relational model have been those that seek to extend the features of the relational database, instead of reinventing them. They suggest that we stand on the shoulders of the database giants before us, instead of starting from scratch.
The problem is that a lot of proposals are resemblance to previous ones because some researchers lack the knowledge of earlier proposals. The problem is important because a lot of time and resources are wasted to replay the previous proposals. The main approach proposed by the paper is to make a summary of commonly used data models over the recent 35 years for researchers to review.|
The strength of the paper is that it not only lists data models, but also talks about their history and model's feature in detail. Another strength is the paper used the a standard example in all the models mentioned, which makes the comparison and evolution of those data models more clear.
The paper has a clear problem definition and proposals, and the content shows the approach clearly. So I do not find any main drawbacks of this paper.
This paper introduced the 35-years history of data model proposals, grouped them into 9 eras, and pointed out the experience that could be learned from each one. The motivation of such summary is to avoid repeating similar failures of design in the future. The author pointed out that physical data independence and logical independence are highly desirable in database model design, and discussed each model in terms of these independences.|
The paper discuss those data models in chronological order as following:
1) The IMS model was basically a hierarchical data model to facilitate DL/1 language, which is record-at-a-time. It uses record types arranged in a tree as the database structure. It has two undesirable properties: redundancy and existence dependency. It limited physical data independence to avoid bad performance, does support a level of logical independence.
2) The CODASYL model uses directed graph rather than a tree, so that eliminates the unnatural existence dependency in IMS, and provides a more flexible way to represent many-to-many relationships, but at a price of more complexity. Also, this model provides no physical data independence.
3) Ted Codd proposed relational model in 1970, which has three largest modifications: store data in simple tables, access through set-at-a-time DML, and get rid of physical storage proposal. With these it provides both independences. Another advantage is that query optimizer can be built for this model.
4) The Entity-Relationship model view a database as a collection of independent entities described by attributes and relationships between them. It languished in 1970s, but it became a popular database design tool, because Peter Chen proposed a methodology for constructing an initial E-R diagram, and it is straightforward to convert from ER diagram to tables in third normal form.
5) The R++ era is a collection of attempts to add new features to a relational DBMS to provide easier query and better performance for some certain applications, such as set-valued attributes, aggregations and inheritance. But such trend had little long-term impact because of lack of revenue potential.
6) In the OO era, there was a temptation to integrate DBMS functionality closely into a persistent programming language, but this would require the compiler to extend for the functionality, which was a huge task for so many different languages. In the mid 1980s, with the popularity of C++, the OODB came to life. The implementation of a persistent C++ had such requirements: no need for declarative query language and fancy transaction management, and the run-time system have to be competitive. But the market was small and most of the venders failed.
7) In the research of semi structured data, there are two basic points: schema later and complex graph-oriented data model. The first term means that a schema is not needed to be determined by a DBA in advance, or be easy to change. Another change is to use DTD and XML to describe a schema. They are very complex and can the data model presented by them can have so many characteristics as hierarchical, set-attributes, links, inheritance etc. The author thinks that it is too complex that it must be simplified, otherwise it might fail.
The author thinks that we have walked through a full circle, the XML is like CODASYL II because it is as complex. And now we are comparing a simpler model with a more complex one. So the author think the history will repeat and XML will have complexity problems.
This paper goes over the big trends in data models that were proposed from late 1960s up to 2000s. For each of the major data models that dominated its era, the paper gives a high level overview of the data structure, the benefits, limitations, brief examples and how it impacted the field then. The major data structures that emerged are:|
1) IMS - hierarchical tree-structure that required one unique parent for each data.
2) CODASYL - directed graph structure that allowed multiple parents to exist.
3) Relational - simple structure that use a high-level DML to access the data
4) Entity-Relational - “boxes and arrows” model that connect entities with similar attributes
5) Extended-Relational - relational model augmented with set-valued attributes, aggregation, and generalization.
6) Semantic - class-based data structure with inheritance.
7) Object-Oriented - data structure for persistent programming language
8) Object-Relational - data structure that tries easily access data that share common attributes.
9) Semi-structured - complex XML data structure that follow many characteristics of previous data structures.
The paper compares the competition between XML and relational data structure to be similar to the debate between CODASYL and IMS, due to one side being a simple, while the other more complex and says the result will be similar to what have happened previously. However, the paper seem to be too hasty in making the conclusion, as the market and industry that requires these systems have vastly changed since then.
This paper summarizes the data model proposals since 1960s, from IMS to XML, and illustrates the defects and the worrying future of the XML in a historical perspective. It points out that the XML resembles to the CODASYL proposal in the way as the complexity and lack of data independence, which may be the reason as its future failure as the CODASYL does.|
Throughout this paper, there are roughly three main aspect that will affects the fate of a data model proposal:
a) A data model sometimes wins merely because it gets the support from market elephant.
b)Large market demand is a necessary.
c) It usually becomes the dominant fact.
a) complexity of the data model may become a lame since people prefer easy stuff unless there is a good reason(like profit driven).
b) the actual performance is really significant. Like ER model, the idea seems fancy but not much performance of improvement.
The data model should be harmonious with the current technical environment. That is to say, it would be much better if the feature is easily implemented or added to the old DB.
This paper summarized the history of data model proposals, examining the strengths and weaknesses of each proposal and describing the reasons that they succeeded or failed in the marketplace.|
Relational databases have been the norm in the marketplace for a long time. The author first describes two of its predecessors (IMS and CODASYL) and gives two reasons why relational systems are technically superior:
1. The first is that the Data Manipulation Language for these databases deals with sets of records instead of individual records, which allows the database system to optimize retrieval instead of the programmer having to do so with a DML that works with single records. This abstraction is called physical data independence, and it allows the database system internals to be optimized without changing its interface to the program that depends on it.
2. The second is that in relational systems data is stored in tables, which are simpler than the tree and linked-list structures used by preceding systems. This data type makes it easier to change data schemas without breaking programs that use the data- i.e. extending the schema on a table won’t break any SQL queries that aren’t interested in the new attributes. This is called logical data independence.
The author also talks about factors in the market and the technology landscape that affected the success of each proposal. The success of relational databases is attributed less to their technical superiority (which was debated at the time) and more to the popularity of VAX computers at the time which were not compatible with the competing CODASYL systems, and the decision of a marketplace giant (IBM) to sell a relational system for its mainframe operating system. Since the rise of relational systems, many other proposals were raised that expanded or tried to improve on the ideas of relational databases like the Entity-Relationship model, and Object Oriented Databases. However, because they did not improve on performance compared to relational systems, and because the benefits of using them were slight, they did not catch on.
At the end of the paper the author sees the debate between XML data modeling and relational data modeling as similar to the previous debate between relational modeling and CODASYL. The author warns against returning to a model as complex as XML, because of the loss of logical data independence.
The paper takes readers through the journey of data model proposals in the history of database systems, which is categorized into nine different eras, from Hierarchical era to Semi-structured era. The authors demonstrate that there are only a few basic data modeling ideas up until now and most of these have actually been around for a long time. The paper aims to teach readers (i.e., current DB researchers) that later proposals of data models have a strong resemblance of earlier ones, just like “history repeats itself”, and make them avoid replaying such history in their research.|
In the paper, the authors do not simply describe each data model proposals in nine different eras, but also provides historical context of each era with the link between the eras. This way, the paper gives a clear explanation of why a certain data model proposal had been made at the time and why it had succeeded or failed. In addition, the paper provides numerous lessons that can be learned from the history of data model proposals, things like “physical and logical data independence are very important and desirable”, “complex models will not get adopted unless they have a clear advantage over simpler ones”, “technical debates are usually settled by the elephants of the marketplace”, etc. Even though not every lesson mentioned in the paper may be applicable to other research areas and some are specific to data modeling, many of the lessons still can be reflected upon your research and give something to think about yourself.
One major downside of this paper is that it seems to promote OR (Object Relational) model unnecessarily more than the others, even when it is discussing other eras. While it is understandable looking at the author’s background and their contribution to the database community, I personally think that the paper would have been much better in achieving its goal of persuading readers with historic lessons, if the paper did not make many such comparisons between OR and other models. It is also interesting to see that while the paper is a bit skeptical about XML DBMSs, the discussion in another paper “One size Fits All: An Idea Whose Time Has Come and Gone” by one of the authors (i.e., Michael Stonebraker) is actually more neutral about XML DBMSs.
This paper both summarizes the evolution of data models over 35 years and outlines helpful tips learned from each generation of data models. In doing so, it can help new researchers to quickly look over past modeling types and review what features of those models worked and what to avoid. However, XML, the current data model used today, is remarkably similar to a data model proposed 30 years ago, which failed mainly due to being too complex. It too may meet the same dismal fate. However, the idea of it being simplified seems to be more likely given the current progression of data models.|
The paper does a nice job of presenting the flux present in the evolution of data model complexity. Initially, a model starts out simple to make it easy to use or ensure certain properties. Then, it becomes more complex as researchers realize there are certain types of data or other attributes they would like the model to be able to handle or possess. Finally, the model is refined or remade to simplify it again.
An important lesson that seems to be missing from the paper is that the evolution or use of a model depends on the type of data current used by the community, especially in industry. The paper remarks that the "elephants" champion a model, which pushes it into mainstream use, but elephants also push the research community to refine or redo a model as the requirements for data models expand.
This paper classifies the past 35 years of data models into 9 eras, and describes the benefits, disadvantages, and market impact of each era. Stonebraker and Hellerstein argue that history repeats itself and current debate between XML DBMS, a complex data model, and relational database models, a simple data model, reminisce of the “Great Debate” in the 1970’s between CODASYL and relational database models. The important considerations for the success of a model are logical data independence and physical data independence. XML will cause history to repeat because customers will again have issues with logical data independence and complexity. The only new concepts in the last 20 years are the integration of code in the data base, from the object relational model, and the schema last concept, from the semi-structured data model.|
The first two eras are IMP, a data base model that is a collection hieratically organized instances of record types, and CODASYL, a data base model that is a directed graph with record data manipulation. Whereas IMP needs to keep track of just the current position and the position of a single ancestor, the CODASYL mode keeps track of the last record touched, the last record of each record type touched, and the last record of each set type touched. CODASYL increases the complexity for the possibility of representing non-hierarchical data. The disadvantage of IMP is that it build loads information from an external data source, while the disadvantage of CODASYL is that it loads data from 1 large network called all at once, so loads take a long time and crash recovery is more involved. Also, large records in CODASYL are loaded in sets, so there are many disk seeks.
The motivation for the relational era was that IMS spent too much time on maintaining IMS applications when physical or logical changes occurred. The relational data model stores data in a table, involves a high-level set-at-a-time data manipulation language, and has no need for a physical storage proposal. The simple data structure resulted in logical data independence, and the high level language resulted in physical independence.
The Entity-Relationship model viewed databases as a collection of entities, which are independent of other entities and have attributes and relationships. It allowed for many to many relationships. The issues with the model were getting an initial set of tables and functional dependences. It did, however, become a popular database design tool.
R++ had little long term influence. It used set valued attributes, aggregation, involving foreign keys and traverse tables, that allowed for traversing between tables without explicit joins, and generalization, which lets specializations inherit from data attributes of ancestors. It was short-lived because it had little performance improvement from the relational model.
The Semantic Data Model was similar to the R++ model in concepts of aggregation, generalization, and sets notions, but also included generalizations of the aggregation construct—allowing attributes in one class to be a set of instances of records, multiple inheritances, and collection of records groups together. The problems were that it was also complex and easy to simulate in relational models.
Object-oriented DBMS aimed to solve the mismatch with C++ and relational languages. The problems were that it required a complier extention with each application. The Object-Relational Era added user-defined data types, user-defined operators, user-defined functions, and user-defined access methods. The issue was an absence of standards.
This paper discusses 35 years of data model proposals for database management systems and groups them into nine eras. The author claims that most researchers in 2005 (when the paper was written) were not around for many of these eras and are making the mistake of going back to paradigms previous researchers have determined are not ideal for future database systems. The eras the author defines are as follows:|
Hierarchical (IMS): IMS had record types that obeyed a data description. Record types were arranged in a tree and each record could have zero or one parent record type. An IMS database was comprised of a collection of these record types. IMS repeats information across records and requires record types to have existing parent record types which restricts the model from covering certain corner cases. IMS had limited physical and logical data independence.
Directed Graph (CODASYL): CODAYSL uses a graph structure rather than a tree structure and so records can have multiple parents. The graph structure makes record types more flexible but unfortunately more complex as well. Loading and recovering these graphs is more complex as well.
Relational: In this era tables were proposed for a relational database that was independent of physical storage. With a simple data structure this model provided better logical and physical data independence. The data that could be represented with a relational model was more diverse than previous models.
Entity-Relationship: This model does indeed look like a cleaned up version of the CODAYSL model, as suggested by the author. It proposed normalizing an initial set of tables for better schema design. This was too difficult for programmers to understand and did not catch on.
Extended Relational: There were many extensions to the relational model that were developed in this era. They did not catch on because they did not have large functionality or performance differences from previous methods.
Semantic: In this era, people suggested that the relational model was "semantically impoverished" and could not easily express data of interest. Most of these were very complex and were not implemented.
Object Oriented: In the 1980's object oriented languages became more popular and so databases moved toward supporting these languages. Work to integrate C++ in databases or as a persistent language were not successful in the market and seemed to resemble CODAYSL. Users seem to be reluctant to switch because there was not a strong enough desire for users to change technologies.
Object Relational: In this era systems were developed to all user defined data types, operators, functions, and access methods. This allowed these relational databases to be used efficiently for a wider variety of applications and gave users more ability to customize data processing.
Semi-structured (XML): The author talks about four different kinds of data here:
1. rigidly structured data
2. rigidly structured data with some text fields
3. semi-structured data
4. text data
He talks about how this appears to be similar to the complex graph type models that were researched many years ago. To me it sounds like this is not the right approach to future database systems. I'm not sure what the right approach should be. Knowledge representation seems difficult to keep flexible while simultaneously easily used and understood.
It seems strange that big companies have so much influence over which technologies are developed and which research paths are taken in the future, but it makes a lot of sense. This is not a way I frequently view technologies I use in my research. The author had some comments about what would happen in the future and how semi structured databases would take a decade to catch on at least. I would be interested to know what happened with this now that it is 10 years after the paper was published.
This paper summarized history of data model proposals and classified it into nine eras.|
1, Hierarchical (IMS): late 1960’s and 1970’s
Data categorized in record types which are aligned in tree structure. Hierarchical sequence keys are used for fetching data.
People want physical and logical data independency. Tree structure has lots of limitations and can hardly be reorganized randomly. One-record-result queries leave the optimization work to the users.
2, Directed graph (CODASYL): 1970’s
Set type added into the DBMS making the previous tree structure into a directed graph. There is no physical data independence. It is easier to move data in when there are many-to-many relationships, however at the same time making recovering work more complicated.
3, Relational: 1970’s and early 1980’s
Ted Codd challenged the CODASYL system with relational tables. It can provide data independence and set-at-a-time DML. “The great debate” was settled by IBM, elephant in the marketplace at the time.
4, Entity-Relationship: 1970’s
Entities with attributes made relationship tables evolve. Although normalization theory did not become popular, the schema design problem did not block ER data models’ way to thrive.
5, Extended Relational: 1980’s
Aggregation and generalization are added as new features to relational data models. As it did not significantly improve the transaction performance, this idea was not brought to the industry.
6, Semantic: late 1970’s and 1980’s
SDM generalizes aggregation feature and provides so called multiple inheritance to data model literature. However sometimes semantic data models are too complicated to be implemented. SDM was not successful in the industry because of the same problems as R++ models.
7, Object-oriented: late 1980’s and early 1990’s
Persistent language became the fashion, however soon fade away as the problem it solves was not a major one.
8, Object-relational: late 1980’s and early 1990’s
GIS brought problems to B-Tree structured DBMS and object-relational data model was born. DBMS is customized to users.
9, Semi-structured (XML): late 1990’s to the pre
Schema later and complex graph-oriented data model came out and XML based DBMS thrives in the field. Schema can evolve along the way of querying. XML imitates IMS, CODASYL and SDM models’ good properties.
This is a great paper that provide readers a clear thread of history of DBMS. Some of my thoughts just to make it even better are as follows,
1, Hierarchical (IMS): late 1960’s and 1970’s
First, it can be added to the take away lessons of IMS that sequential tracing method may lead to repeated information and useless data (as existence of a leaf node depends on its parents). Second, more industrial usage can be discussed here, e.g. IBM can be mentioned as the inventor of IMS.
2, Directed graph (CODASYL): 1970’s
Links between ages should be clarified. It may be good to present historical thread to show how the DBMS evolved. Here it may be good to mention the person/company who pushed forward the IMS tree to the CODASYL graph.
3, Relational: 1970’s and early 1980’s
Competitions between big companies are mentioned here as some changes in the history of DBMS are caused by company competition. It would be better if the reason for IBM to transfer from IMS to relational databases and SQL can be revealed here. It would also be good to mention why SQL was chosen.
SQL examples should also be provided here.
4, Entity-Relationship: 1970’s
It is good to draw ER diagram, however, the paper did not provide a way to transfer a ER diagram into relational tables which are actually stored on disks.
5, Extended Relational: 1980’s
It is good to see SQL codes here. It would be better to see them in a good format that differs from normal paragraphs. Maybe a box is a good idea.
6, Semantic: late 1970’s and 1980’s
It can be inferred from the length of this chapter that this idea died before bringing a wave to the literature. Maybe another take away lesson could be added here that complex models cannot walk out of papers.
7, Object-oriented: late 1980’s and early 1990’s
It says “language like C++”, is there any other languages suitable for building a data model? If so, examples should be mentioned.
8, Object-relational: late 1980’s and early 1990’s
Still, the codes could be put in a box.
9, Semi-structured (XML): late 1990’s to the pre
More information about the elephants in the marketplace should be provided in order to help readers to foresee some future of data models.
There is no example in the schema evolution section, or some reference links should be provided. This is an interesting idea and readers would want to know more about it.
The paper discussed a survey of different data models from a historical perspective and addresses different characteristics which are useful to take into consideration in deciding whether a particular model is going to be successful or not. The main data models discussed includes: hierarchical, CODASYL, Relational, Object-Relational, and Unstructured model (like XML). One good characteristics of a data model to be successful is providing physical data independence. This allows to implement the same database in a different but efficient physical medium without modification of its design. The hierarchical and CODASYL data model lacks this characteristic. In addition, while the hierarchical model has a very restrictive tree based structure, the CODAYSL model lack simplicity since it uses a complex directed graph. The relational model having a simple table structure and set-a-time language instead of record-at-a-time language allows it to have a logical and physical data indep!|
endence. This allowed the relational model to be more successful.
In addition the authors don’t hesitate from mentioning the shortcoming of the otherwise successful relational data model. Its usage of a one dimensional access method like B-tree was not as much efficient for geographic information system based application. This motivates the Object-Relational model including Postgres. The other relatively newer data model is the unstructured data model which include various XML based models. The paper argued that their complexity will prohibit them from flourishing. They considered the XML based model as CODASYL version 2 and make an argument that the historical debate between relational and CODASYL will be repeated with an eventual win of the relational model. In analysing new good ideas in going forward , the paper identified having a code in database which includes stored program, triggers, etc. as it avoids unnecessary communications between the application and the database.
In addition to identifying the aforementioned characteristic of good and bad ideas in data models, the paper mentioned that the market player including VAX and IBM might affect the success or failure of a particular model. While the success of the relational model can be largely attributed to the good characteristics mentioned above, this model as well benefited from a relatively newer minicomputer revolution through VAX during the early phase. CODASYL’s program was written in IBM assembler and was not portable to VAX’s minicomputer.
Although detailed and insightful, most of the points made are based on historical market success store and not supported by a research finding. The paper doesn't consider other parameters including the exponential increment in the hardware computing power which allows us to process and store complex database more efficiently. For example, NVRAM based memory allows us to access large data efficiently as compared with disk. As a result, the data model complexity issues in terms of computing and processing power might not be as bad as in the disk based system. The paper should have included more academic based research in addition to the market oriented success story to predict future success of a particular model.
Generally the paper has a good summary of an exhaustive list of data models . It also explained the various models and its good and bad characteristics with a coherent single supplier/part database example.
This paper reviews major data base model designs in the past 35 years. The author found that the current XML model is similar to CODASYL, which failed for its complexity and no support for physical data independence. It is meaningful to review the history and different eras in data base model design so that later designs can avoid the same mistakes made before.|
There are nine different eras. New eras occurred for different reasons. Some of the models failed because they have downsides which were overcome by later better designs. For example, CODASYL and IMS did poorly in simplicity and physical data independence and got replaced by relational database model. Some of them failed because they didn't catch the market even though they are good ideas, like semantic database model. Some of them emerged for new or different market requirements, like object-oriented DBMS. Many users of this are engineering software specific like CAD.
Even though some designs are defeated by some others chosen by the market, they may probably come back again in the future. I agree with the author that code-in-database is promising in that it makes database coding easier and less error prone. More efforts would probably be put into this area in the future. Interfacing database with other program language and abstract database out is a good idea.
As we review the history of database evolutions, there are a lot of succeeds and failures to reflect on and sometimes we need to revisit those failures which might make a difference today.
Title of the paper: What Goes Around Comes Around|
Author: Michael Stonebraker
The paper mainly discusses several data models according to the chronicle order of their appearance. It includes nine different era, from IMS era to the latest one which is X Semi Structured Data(XML). It discusses Relational, ER, Semantic models and Object-oriented models as well. The content of the paper is important because of explanation about the evolution of data models and some indication for future researchers from existed models. Since there are only a few basic data modelling ideas, and later models have strong similarities to some previous ones, it provides a better and comprehensive understanding of these models by taking a look at the history of models. In addition, the paper also summarizes some lessons for readers in order to let them avoid repeating the mistakes made by previous data models and help them build up a data model with fewer drawbacks.
The main approach which used in the paper is listing all the improvement and lack of data models compared to the previous invented ones. For instance, the paper states that relational model provides a better logical data independence and more flexible way of representing common situation than CODASYL .
The strengths and technical contribution of the paper includes the followings:
Discussing not only pros but also cons of any data models
Data models are introduced by the time of appearance, it is well organized
Showing some lessons after finishing discussion of any model, help people avoid repeating the mistakes.
Illustrations provide a clear and easy way to understand the basic ideas of data models
Introducing some terminology used in models, such as semantic heterogeneity, schema now schema later etc
These models are not only talked academically but also practically. The paper indicates the strength and weakness of some models in practice (i.e Business)
The main drawback of the paper is giving a broad view of all data models instead of going deep and explaining thoroughly about one specific popular data models such as relational models, ER models or XML.
This paper is about the data models proposed in the past 35 years. By summarizing data models in the past, this paper mainly addressed two issues. The first is about researchers. Most current researchers haven’t participated in many of the previous eras, and are proposing data models with obvious flaws of models of those eras. By presenting lessons learned in the past, the authors hope to allow future researchers to avoid those problems. The second is about data model. The paper argues that the current XML data model resembles the CODASYL proposal. Through presenting weaknesses of data models such as complexity problem in the past, this paper gives reasons why the current XML data models is problematic and predicts that it will fail.|
This paper gives a very good summary of all the eras in the past. For each era, models are presented and examined. Then flaws are discussed, and lessons learned in this era are summarized. Also, forces that driven transitions between eras, such as technical problems and elephants in market, are also studied.
The best contribution in this paper is it not only provided a set of rules to follow when designing future models, but also gives the reason and historical facts about why these rules need to be followed.
Another good thing about the paper is that it made clear to the reader that competition in data models is not only about technical issues, economical factors such as performance and big companies also play important roles in it. To design a good model for the future, one must come up with something that has either big performance or functionality advantage.
Also after reviewing the concepts of the past several decades, the paper points out that the only novel and promising concept now is Code in database, which gives a direction for future research.
This paper is well written and fulfilled the purposed it set at the beginning.
This paper summarized the data model proposal into 9 eras. The first IMS (Hierarchical data model) based on the language DL/1 arranged and organized its’ record type in a tree. Every record has HSK hierarchical sequence key which use for the commands: get_next and get_next_within_parent. The tree structure introduces the limitation: information repeated and existence depends on parents. It’s an expensive efficiency work for programmer to construct the query by a record-at-a-time language. The coming model CODASYL create a model more flexible but complex. It introduces multiple parents, arcs, and the 1 to N relationship, etc. But it still couldn’t defend the physical independence even break the logical independence in IMS era. The important step for relational model is it has logical and physical independence. It depend the information lost and easy to construct a simple data model. In this era DBML uses a set-at-a-time language. Based on the complex model that is crea!|
ted in CODASYL era. The Entity-Relationship era conclude the relationships to boxes and arrows. The R++ era generates by some industry area but not be noticeably faster than old one. The semantic data models era also has little influence on developing database. The OO era first dealt with the address impedance mismatch but failed based on the programing language that time. The OOBD focus on the engineering database. The Object-Relational era introduce user-defined definition to improve it more helpful in business data processing. With the fast developing in OOBD, the database language comes in to the semi structured data era. Based on the information’s classes could be user defined, the query face a big challenge without structure. It introduce “schema later” which required rigidly structured data and information and do some “schema evolution”. XML hierarchical some advantages from some failed era. And lead to some structure like JSON or B-tree.
The author writes 18 classes to warning us don't to repeat the history. It's a great conclusion for database history, and is a good mistake paper.
But the full paper is saying the database develop based on the market but not just on the technology. If the marketing decided the where our technology is going. It's better to conclude the limitation only by technical level. Warning us do not repeat the history on the complex and unclear structure.
This paper describes the evolution of data models for databases in a given span of 35 years. It is relevant in the sense that the current generation would have access to issues and discussions from that time period. It also explains the transition of thought with respect to what is expected and what is achievable from a given data model.|
There are a few points that they have emphasized on in terms of what is important with respect to a data model which is universal until this point of time.
1. Ease of navigation through the database
2. Complexity of language
3. Complexity of data access
4. Complexity of data storage and update
5. Importance of physical data independence and logical data independence
6. Importance of compliance with current major standards
It was interesting to see how two of the data models that seemed to have been created for specific purposes, proceeded in two different ways. OODB that was created only for engineering CAD, failed because of no relevant compliance to standards and not addressing relevant issues for these specific programmers. However, Postgres (from the OR era) on the other hand, implemented a mechanism to include UDTs, UDFs and user-defined access methods. It improved performance in data access and this was not by changing the aggregation and generalization methods as did the previous data models which gave it a huge edge.
The paper mentions that there are no languages that support DBMS functionality however, nowadays we do have frameworks that attempt so; for eg: Linq for SQL from .NET.
The paper also mentions the lack of ease of schema evolution in existing database products and it must be commented that with the recent rise in focus on social networking websites and real-time data, non-relational databases (eg:- NoSql) are being considered and used.
Big data is currently dealing with working around semantic heterogeneity and there seems to be a lot more hope about finding a way to make sense of diverse databases in comparison to the tone subscribed by the paper in the end.
This paper presented a total history of the database research over the past 35 years, which are divided into nine eras. Each era is introduced with the academical and commercial background, which are used as supporting evidence for the author’s comments. A short summary of the nine ears would be:|
Hierarchical era, represented by IMS.Tree structure and ‘record at time operation’ would result in drawbacks like data duplication and existence dependance. Certain limitations also lowered physical data independence and forced programmer to manually implement query optimization.
Directed graph era, represented by CODASYL. More flexible structure means more complexity. lost more data independence and increased the programming workload for manipulation.
Relational era, inspired by Ted Codd’s relational model. His proposal of simple logic data model, high level ‘set at time’ access language and complete physical independence led to the ‘great debate’. Opponents have doubts about the efficiency and achievability, which are soon proved by the success of the VAX systems and the launch of DB/2 by IBM.
Entity-relation era, reformat the relation model into entities and their relations. Instead of an alternative, it only serves as the conceptual model assisting relational schema design.
R++ era, represented by gem. Various small improvements to the original relational model in certain corner cases, epitomized by the proposal of set-valued attributes, tuple-reference and inheritance. But all of the above turn out to be just easier query interface instead of improvements in transaction performance and scalability.
Semantic data model era, represented by SDM, emphasized class and multi inheritance, but most semantic model are too complex. Since achievable ones can be easily mapped back to relational model, market are small.
OODB era, attempts to persistent C++. provided in-process persistent objects, serves as throw backs to the earlier era. It is easier to manipulate, but lacking in standards and cross-platform support.
Object-relational era, inspired by GIS, treating code also as data, providing UDT, UDF and user defined access method to achieve performance improvement. Industry giants’ support helps in growing, but still lacking in standards.
Semi-structured data era, represented by XML and Xquery. schema-last model has a small market, and can’t help solving the semantic heterogeneity problem. XML is a model that carries almost all features of previous ones. XML can be hierarchical and networked, it supports set-based attribute and inheritance. The future of XML is either fail, get simplified or booming with known shortcomings and causing another ‘great debate’.
One thing that can be learned from the lessons that the author summarized for the DBMS industry is that, if a new data model that can create great market influence, it has to challenge the current market of the relational model or totally avoid the competition. XML seems to be at the position of taking new market place. while object relation models are themself within the category of relational models. So I would say, in the future, noSQL databases like mangoDB, dynamoDB and documentDB(google file system) would have great potential in market, since relational model is not their direct components. And as we can notice the popularity increase in map-reduce method would also help the above database product conquering the market.
This paper provides a summary of data model proposals in last 35 years(grouped into 9 eras). This paper present how the data model developed and advantages and disadvantage of these data model proposals. It is good to learn lessons from history data model proposals to avoid imposing same disadvantage of data model proposal in the future.|
It present data model proposal in nine historical epochs:
1) IMS: It is a hierarchical data model which has a tree structure of record type. But it has some restriction like information duplication, bad physical data independence, and complexity.
2) CODASYL: It is kind of similar to current XML and organize record types into directed graphic which is a collection of named record types and named set types.It solves some restriction of IMS like existence dependence but still have limitation like more complex and bulk-loading.
3) Relational: Ted Codd proposed relational model which store data in tables, high level language(Set-a-time) to access data to avoid specifying how data stored. The high level language query optimizer saved time for DB programmer to do optimization theirselves.
4) Entity-Relationship: databas as collection of entities with attributes, and there are relationship between different entities. But DBA should know functional dependence before doing normalization.
5) R++: introduced "set-valued attributes", "aggregation", "inherit". But it made no big impact on commercial world.
6) Sematic: focus on classes. It has same problems as R++ and have little long term influence.
7) OO: Integrate DBMS functionality into object-oriented program language. This proposal fails in market mainly because only C++ support.
8) Obect-Relational: like "Postgres" allow user to customize(types, function, access) DBMS to their needs and put code in the database,stored procedure.
9) XML: two basic points:
schema later: not need the schema of data before loading record or schema fluid. Sometime not proper without third class of applications.
complex graph-oriented data model: some features like hierarchical, inherit, union types, etc..
In conclusion, the development of data model have gone a circle and current proposal is a superset of previous proposals. The author has noticed that the history is repeating and seems that a better proposal is trade off different performance. So I think there are two ideas for data model developing. One is consider all the property of a model proposal and develop a best over-all performance score. Another one is develop a proposal which is expert in some specified cases which maybe every bad in some other cases.(I am not sure about this because it will meet some problems about standards).
This paper proposes the idea that we are repeating history by focusing on the XML schema as the data model for databases. Back in the 1970’s, “the great debate” between CODASYL and relational systems ended with relational systems coming out on top for several reasons:|
1. the lack of physical data independence with CODASYL
2. query optimizers for relational systems can outperform most of the record-at-a-time application programmers.
3. the biggest companies in the industry, IBM at the time, decide which technology wins
After relational database systems, the field saw many new innovative ideas such as R++ and object-oriented databases. However, the advantages that these new models bring to the table did not outweigh the cost of moving from the traditional relational database system. Even though there were several breakthroughs, we learned that “unless there is a big performance or functionality advantage, new constructs will go nowhere”.
The paper did a great job of illustrating why each new data model succeeded or failed by describing the technical advantages of the new model as well as the industry’s needs at the time. It thoroughly explained why the relational model has been the main database model until now and related XMLSchema with CODASYL. There are many key concepts that make XMLSchema very similar to CODASYL such as hierarchy and the ability to have links to other records. The paper also highlights the new features of XMLSchema that may be too complex to implement. However I believe that there are a few loopholes with the argument:
1. The industry has evolved 40 years since “the great debate”. We have more computing power and resources compared to 40 years ago, so the setbacks that CODASYL faced may not be problems for XMLSchema. For example, one reason that C/C++ came about was because we needed the efficiency due to limited computation power. Today, languages like Java are more appealing to programmers because of the simplicity even though it is not as efficient.
2. The paper did not explain the differences between CODASYL and XMLSchema that may set XMLSchema apart and help XMLSchema overcome the challenges CODASYL faced.
This paper provides a concise but fairly complete summary of the development of data models and query languages throughout several eras. The paper emphasizes a need to learn from history and be aware of how database technology has developed in order to avoid repeating the mistakes of the past. The paper offers the current movement toward XMLSchema and XQuery as an example of why this topic is important, showing how these systems bear striking similarities to past systems such as CODASYL that have failed because of their complexity and inability to easily represent common data relationships.|
The authors list several lessons that can be gleaned from each era, but they make a special point of enforcing a few crucial concepts throughout the body of the paper. These are:
1. Databases and query languages should be kept as simple as possible. While complex systems can be powerful, they rarely gain widespread acceptance, since programmers and DBAs are unwilling to trade moderate improvements in functionality for severe losses in usability.
2. Physical and logical independence need to be preserved. A database system should provide flexibility both in the way data is stored on disk and in the way records and data elements are related within the database. Business needs and optimal performance tuning are both subject to change, and a database should be able to accommodate those changes.
3. In order to gain acceptance and market share, new database technologies need to offer significant increases in efficiency, ease of data representation, or flexibility. Moving to or a system is both risky and costly, as the new system is unproven, and integration generally requires changing existing code and/or schemas.
This paper does an excellent job of describing the strengths and weaknesses of each database system, judging them primarily on the criteria listed above. What I found very interesting - and what many programmers and engineers often forget - was how heavily the business side of the industry drives the development and prevalence of these systems. The explanation of how programming language experts ignored or refused to work on integrating database functionality into common languages such as C++, was a telling example of how the cleanest and most efficient solution is not always the one that wins out.
One issue I had with the paper was its casual dismissal of the progress made in machine learning over the last few decades. As text mining, natural language processing, and other similar technologies continue to make significant strides, it seems as though a simplified version of XMLSchema and XQuery could be powerful tools for database management. The flexibility and widespread use of XML could expand the potential application of these systems far beyond the realm of semi-structured data.
This paper summarizes data models since late 1960’s, dividing 35 years into 9 different eras. The importance of this paper is that when we design data models, we can learn lessons from the proposals in each previous era. By learning from history, we can avoid repeating the same problems as before.|
In the following, I will summarize data models in 9 eras in time order, including some advantages and disadvantages mentioned in the paper.
(1) IMS: It is tree-structured, and has record-at-a-time DML. Main problems are repeated information and existence depending on parents.
(2) CODASYL: It is a directed-graph data model, and has record-at-a-time DML. It has limitation when expressing a three-way relationship, and has more complex loading and recovering.
(3) Relational: It is table-structured, and has set-at-a-time DML. It is more flexible than IMS and CODASYL when expressing relationships.
(4) Entity-Relationship: It is popular as data base schema design tool. However, it is too difficult for real world DBAs to understand functional dependencies.
(5) Extended Relational: It has extensions adding to the relational model, such as set-valued attributes, aggregation and generalization. However, it offered little performance improvement.
(6) Semantic: It focuses on classes, and allows multiple inheritance. Most semantic data models were complex, and were generally paper proposals.
(7) Object-Oriented: It integrated DBMS functionality more closely into a programming language, such as persistent C++. It does not need declarative query language and transaction management.
(8) Object-relational: It added user-defined data types, operators, functions, and access methods. It can include stored procedures in a DBMS.
(9) Semi-structured(XML): The two features of XML are schema later and complex graph-oriented data model. In schema later system, data instance must be self-describing, and has schema evolution.
The general lessons learned from history:
(1)Physical and logical data independence is desirable
(2)Set-at-a-time language is more preferred than record-at-a-time language
(3)New constructs need to bring big performance or functionality advantage.
(4)Packages will not sell to users unless they are in major pain.
(5)Schema later is a niche market
To sum up, this paper emphasize on the importance on learning from history. We can avoid repeated problems if we can know more about the pros and cons of different data models in each era.
This paper summarizes the evolution of databases and talks about the strengths and downsides of each of the 9 major eras or movements. It also warns us of how we might have come full circle and are dangerously close to remaking the same mistakes we made a quarter of a century ago and alerts us to learn from the past not to repeat it.|
The first major era was the IMS Era which had it’s success’ but had lots of data duplication that could lead to many problems, and was record-at-a-time querying. An improvement to this was the CODASYL Era which removed lots of the need for data duplication that IMS needed, but was very complex and provided essentially no physical data independence. This lead to “The Great Debate” between the record-at-a-time complex CODASYL style and the new Relational Era. The Great Debate brought out positive modifications for both models as they each tried to show why they were superior and eventually the relational model won. The relational model used set-a-time languages which were good and provided lots of data independence, as well as making it simpler to query and optimize; but really it won because of elephants in the marketplace adopting it for reasons not really related to the technology itself.
Since the relational model won the great debate there have been many proposed add-ons to the relational model, but none have particularly gained traction. The Entity-Relationship Era didn’t gain traction because it was too complicated to understand for a majority of users, the R++ Era didn’t gain traction even though it was a good idea because the performance benefits were not great enough to learn the new technologies, and the Semantic Data Model Era and Object Oriented Era had the same flaws as the R++ Era.
Currently there is a Semi Structured Data movement, that has ideas the seem good but they are very complex and are eerily similar to mistakes made in the CODASYL Era. This paper strongly warns of this Era being a full circle and bringing us back to where we started half a century ago and reintroducing the complexities we wanted to avoided in the initial great debate.
It was a very informative paper and one that I definitely recommend all people interested in databases read, as it’s important to know about the history of the field and to avoid making the same mistakes.
Fun fact: There are two “lesson 14”’s in this paper (no one is immune to mistakes I guess).
This paper gave a broad overview of the evolution and cyclical nature of past proposals for data models. While providing no contributions that are necessarily technical or practical for the purpose of implementation, there is a wealth of insight regarding the advantages/drawbacks of proposed models in comparison to the models preceding them.|
One of the most appealing themes of this paper was a technical description of each model proposed, in order to highlight the differences in styles with specific examples. For example, the description of IMS as a tree-based hierarchy was shown to be clearly limited in its capacity for data independence compared to later models, but it was also clear that the most immediate successor, CODASYL, was much more complicated in its directed-graph structure. Nevertheless, the authors could have simply described the market-dominant players in data model proposals and addressed the evolution from IMS to relational models and the newer controversy between XML and relational data advocates. Instead, they went on to describe many of the other data model proposals (R++, semantic, persistent programming) that seemed to have value, and posited explanations regarding why they failed. Both technical and real-world influences (i.e. programming-level and market influences) were also discussed, g!
iving a more well-rounded insight into exactly how much of a benefit certain models held over others, and in what areas such that they succeeded/failed to varying degrees.
Upon reading the end of the paper, however, it seems a bias begins to emerge and the tone shifts from objective to opinionated on the more recent evolution of data models (e.g. “we have no quarrel with these standards”, pp. 35). In the same vein, the argument that everything comes full circle and repeats itself seems forced. For example, while the debate between XML and relations is similar to the first “great debate,” there is quite a different context which the authors do not even attempt to address: the presence of every previous proposal and debates over each model. While the market forces may act in the same way and cause different standards to emerge, there is also the possibility that both standards are accepted among different groups of entities that might have a need for different types of models, and this may lead to increased improvement over multiple areas rather than improvement being considered over the broad field of “data models” in general (e.g. !
very efficient and simple xml- or schema-last- based models might emerge in one branch, and relational- or schema-first- emerging in the other).
A common theme held throughout the paper was the emphasis of market forces on the adoption of certain models over others. While there is a sea of academic research papers that constantly improve upon existing models or implement entirely new methodologies, the ability to “sell” a model seems most crucial overall. Moreover, large companies that have build their enterprise on legacy systems are less likely to adopt an entirely new system (even if it is leagues better) in order to avoid the cost and time associated with migrating data management systems (e.g. IBM’s Project Eagle, an attempt to move to a relational interface system). However, it is my belief that with the recent burgeoning of start-up culture and the plethora of new companies vying for many different technologies, there may be a shift in data model usage. For example, if a smaller company decides to adopt a specific data model early on, and it becomes even moderately successful, this would likely garner at!
tention from the technical community and cause further discussion about why the standards are standards, and about the many different alternatives to widely accepted technologies.
In this paper, Stonebraker and Hellerstein tries to summarize the types of data model that have been proposed in the last 35 years. Starting from late 1968, the paper divided the period into 9 eras, based on the characteristics of the proposed data model. It shows us how the limitations of the existing data model in each era inspired the birth of new data model, some of them last until now (with slight modifications/additional functionalities from latter eras) and some of them do. One thing that has caught my attention is that there seems to be a constant battle between programmer and data models, especially after the birth of SQL. Because SQL has its own language structure, for programmers, there is this need for a persistent language that enables them to modify data straight from the code. I like it that OR data model comes out as an answers by merging coding and data in the same level.|
If anything, this paper has two main contributions. First, it serves as a guide to understand the desirable attributes of a model and how to find the most suitable and practical data model while trying to even out the good/bad. By showing the “evolution” of data model starting from its early days in the 1960’s until today (or around 2005), it shows there are always trade-offs between physical/logical independence with flexibility and ease of transfer from real world model to data model. Second, this paper reminds us that market niche plays a significant role in determining whether a data model could last or not. Third, this paper also tries to “warn” us about the current tendency to favor the semi-structured data for its less restrictive data input format. True, the semi-structured data model makes is more flexible concerning data type and data exchange, but then we would have to sacrifice the logical data independence aspect, which could bring us back to the CODAS!
While reading this, I think the writers want to point out that the relational model is the simplest and most reliable. The R++ model, SDM, and even OR could be considered as additional functionality to relational model rather than independent models on their own. However, I wonder if relational model has reached its saturation point. Is it possible for XML – while may not be as reliable when standing on its own – to be embedded on relational model, considering its flexibility and ease of transfer?
The main goal of this paper is to walk the reader through the main data models that the database community has considered over the past 45 years or so. The authors state that this generation needs to learn from these historical data models, because the latest models are starting to repeat some of the (not so positive) characteristics of early models that are no longer widely used for those reasons.|
This problem is important because repeating history is the exact opposite of doing novel research. Most literature reviews today are conducted by searching for relevant terms or citations on Google scholar, and not all of these historical (although seminal works in their subfields) are available online. I believe that this is a contributing factor to history repeating itself with the current generation of researchers. A survey paper such as this one provides a useful, but brief, background into the data models that we should know about as we learn about and investigate the most popular data models used today.
The authors step through the various models and point out both their advantages and drawbacks effectively and each section is ended with a presentation of “Lessons learned” from this database model (or unsuccessful attempt at an introduction of a new model). I think a main contribution of this paper is in these “Lessons” that are presented. Even separating these from the main paper, the reader is left with 18 concise lessons that the reader should keep in mind through future work with databases. These lessons unforgivingly point out the failures of previous models, even if those failures might be based on the behavior of “elephant companies” in industry or the popularity of specific hardware.
If I had to pick a couple drawbacks of this paper, my first quibble is with the formatting. The paper enumerates several lists within the sections, but there is no corresponding numbering or formatting difference to emphasize these enumerations. One other drawback is that the paper does present a fair number of points that are the opinions of the authors. I do, however, think this is inevitable in a history-based paper such as this one that attempts to reason about why specific models failed at the time that they did, and I believe that the authors do a respectable job of pointing out that they are speaking of their opinions in the places it is most discernible.
This paper summarize 35 years development of database management system (1970 - 2005 based on its publication year) with 9 era (model) and 18 lessons. It first introduced the Hierarchical system IMS in 70s; then continue with the Graph system CODALSYL, followed by The great debate between RDBMS and the previous two. After the great debate, the paper introduced the ER data model and some data design technique, followed by a summary of R++ models, namely a variety of attempts of applying RDBMS to different areas. After that, it summarize semantic data model, OO models and then Object-Relation model. Finally it comes to the early 2000s and then summarizes and foretells the future of the semi structured data.|
Each of the data model is introduced with the following aspects: its background, its schema, some example to illustrate the advantages of the presenting model toward its precedents, as well as its limitation. These models is introduced in such way to indicate that RDBMS research is a cycle, some of the lessons should be learnt from the history to prevent them happening again. Those lessons can be summarized as follow:
1. A successful DBMS system should provide easiness along with rich functionality for its clients, this can be achieved by
1. Using of high level data abstraction (physical independent & logical independent) with easy data structure (table rather than tree or graph)
2. Rich and easy to learn data manipulation Language (SQL) and flexibility of adding new operations or data type (OR)
lessons : 1, 2, 3, 4, 7, 8,11,12, 14.1, 18(barely)
2. A successful DBMS system should have promising performance to fulfill the expectation of its client,as well as to sweep its precedent, this can be archived by
1. A sophisticated query optimizer to optimize its planning
2. Some methods to optimized its runtime (Store Procedures or UDT in OR)
3. A successful DBMS system should aim at large market place or be supported by a Giant industrial company or tech Community to leverage its impact and get support for its continue research and optimization
lessons: 9, 13, 14.2, 15, 16,
4. DBMS research and industrial is a cycle, many of the technique and ideas that created later can find its ancestor in the before
lessons: 5, 6,17,
1. This paper provides an extremely amount of details of the history development of DBMS, as well as a rich amount of first hands experience of the historical events which makes the narration vivid and interesting.
2. This paper has successfully made its point by providing a rigorous illustration and comparison of different type of DBMS with a detailed analysis of the success or failure of each DBMS.
3. This paper has a consistent example of the first three model which helps a lot on understanding the old models.
4. This paper teaches several great lessons on how to prevent certain failure for the future development of DBMS.
1. Though the paper has successfully foreseen several future trends in DBMS research (especially the data lineage idea which is then used by Spark), it fails on predicating the great success of semi structured data, especially the schema later model, which right now used by Hive and SparkSQL and is the most important fundamental of big data era. However this is understandable because of the limitation of the time.
2. Though the paper is very rigorous on its illustration, there is one mistake (typo) I noticed in this paper: on page 26, last line:
“Where Xmax > X0 and Ymax > Y0 and Xmin < X1 and Ymax < Y1”
I believe it should be Ymin < Y1 for the last condition.
This paper reviewed the history of data models from late 1960’s to early 2000’s. During this period a couple of data models were proposed, namely Hierarchical (IMS), Directed graph, Relational, Entity-Relationship, Extended Relational, Semantic, Object-oriented, Object-relational, Semi-structured (XML). Among those technologies only some of them are widely adopted by industry and had long-term impact. The author went over each one of them and presented both the technical advantage and the context of that time when they were proposed, thus revealed the reason behind their impact.|
Obviously reader of this paper can get an overview of the aforementioned data models and know the key difference of them as author presented in a chronological order and carefully explained the motivation of waves technology trend. There is another take-away from this paper, even more important in my opinion, that the success of a technology could be determined by quite a few factors other than its technological advantage, such as its complexity, whether it’s easy to understand by people, the high level strategy of giant companies towards it and the scale of its market. By learning the relation between these factors and the fate of the technologies, we can avoid making the same mistakes as those in the history.
At the end of this paper the author groups the innovation of these data models into two groups, code in database and schema last. And he is not optimistic on schema last.
This paper was written a decade ago, so it’s quite interesting to evaluate the author’s prediction. Looks like he didn’t foresee the incoming trend of noSQL databases, but history is so similar, as the main reason that draws noSQL is also performance wise, and the paper did cover this factor when presenting the past.