This paper aims to provide a guideline for researches in the field of data modeling. This paper summarizes three decades of data model proposals, discusses the challenges and proposals from different eras, and shows lessons learned from proposals from different eras. This paper discusses proposals from Hierarchical(IMS) model originated in 1960's, Directed graph (CODASYL) and Relational Model (and the debate) in 1970's, multiple extensions to Relational Model (Object-Relational, for example), to the growing XML (growing when the paper is published). The discussion is important because as the authors point out, the history of data modeling tends to repeat itself (repeating debate between simple and complex data models, and reinventions of ideas in some proposals). Learning what attributes make past data models successful or not guides future researchers, and helps them avoid making similar mistakes. The strength of this paper is that it introduces proposals from different eras with a reasonable amount of details and examples. It clearly illustrates the advantages and drawbacks of proposals, both from a research perspective and a commercial perspective. The authors hold the viewpoint that simple data models like Relational Model (and its extensions) are more desirable than complex models, and new data models should have good performance increase and market usage (recognized by huge companies or industry) to succeed. This paper is well structured and clear to understand. One drawback is when considering the future of complex models like XML, the authors don't consider increasing computing power may make these complex models more useful. |
This paper addresses the problem that most current researchers are not aware of the details of historical data model proposals in the previous era, which may lead to the same failure as what happened before. For instance, current XML eras is tending to repeat the failure of CODASYL proposal in 1970’s because of complexity. By grouping data model proposals in 35 years into 9 different eras and presenting them, this paper aims at helping future researchers learn from history and avoid the same mistakes. Using the same Supplier-Parts-Supply example, the paper showed the development and evolution of the data structures and algorithms in early database history from hierarchical, directed graph to relational and improvements on the relational model. Intuition and outcomes of each design are showed and analyzed thoroughly and the comparison between different designs in different era is clear. The paper also studied the reasons for the market failure of some proposals like the OODB, which is a good reference for the commercialization of future research product. The paper is structured well with summarized lessons after each era. However, some words in the lessons are ambiguous. For instance, Lesson 5 is not defining what is “complex” and how is that different from the “complex” in Lesson6. Some of the lessons are making less sense technically and outdated. Lesson 9 mentioned “Technical debates are usually settled by the elephants of the market…”, which is not worthy to discuss in detail. Another drawback of the paper is that it discusses little about the non-relational database models like NoSQL, which is increasing used in big data and real-time web applications. |
The paper summarized all the data models and associated query languages over 35 years into 9 different eras, by giving the basic description, and pros and cons of each model. The paper also presented analyses of why the model succeeded or failed during that time and the lessons learned from the exploration of the model proposals. The intuition behind the paper is first, the later proposals inevitably bore a strong resemblance to earlier proposals. What’s more, as most current researchers were not familiar with many of the previous models and have limited understanding of what was previously learned, they were more likely to repeat the same mistakes as their predecessors. Therefore, this paper aimed to help researchers learn lessons from previous eras and avoid replaying history. The paper presented data model proposals in nine historical epochs, which are: 1. Hierarchical (IMS): It had two undesired properties: (1) Information was repeated (2) Existence depended on parents. Programming was complex and commands were restricted. 2. Directed graph (CODASYL): It was more flexible but more complex than IMS, and still had limitations, with poorer logical and physical data independence than IMS. 3. Relational: This model offered simple data structure, high-level DML, and better data independence, so there was no need to specify a storage proposal, and it was flexible enough to represent almost anything. 4. Entity-Relationship: This model only succeeded in schema design(normalization theory), and became a popular database design tool. 5. Extended Relational: These models proposed additions to the relational model. The best model, GEM, added (1)set-valued attributes (2) aggregation (tuple-reference as a data type) (3) generalization to the relational model. But they failed because while allowing easier query formulation, these models offered very little performance improvement. 6. Semantic: These models focused on the notion of classes, supported and generalized aggregation. But these models were too complex to be applied to industrial field. 7. Object-oriented: These models addressed impedance mismatch, and integrated DBMS functionality more closely into a programming language, but failed because they required DBMS-oriented functionality, which was not provided. They lacked strong transaction and query system and ran in the same address as the application, therefore offered no data protection. 8. Object-relational: This model made GIS queries simple and possible to customize DBMS to user’s particular needs. It put code in the data base (and thereby blurring the distinction between code and data) and quickly offered a general purpose extension mechanism to respond to market requirements. 9. Semi-structured (XML): This model could be used for data movement across a network, but is limited in markets, and may not actually solve the semantic heterogeneity problem. The paper analyzed the reason behind success or failure of each model, and got the several lessons from the previous models: (1) New constructs must have a big performance or functionality to become successful. (2) Now the debate between XML and relational database is similar to the previous debate. The paper gave enough details to convince readers but not too much details to make them lost. The main drawback of this paper is in fact it didn’t give a clear solution to today’s debate, i.e., as XML may fail, what we should do to solve the existing problems. |
Let me see if this works! Testing1234 I hoe this time works Testing 123 |
In light of the rapid technological changes that have been taking place in computing systems over the past few decades, it can be tempting to view progression as a steady march over time, from simple to more sophisticated and complex systems. The reality, however, is often much less pretty. For database systems in particular, new schemes are constantly proposed, some evolutionary and others more transformative. Ultimately, only a few become widely adopted while the rest fall into obscurity. In their review article “What Goes Around Comes Around,” Michael Stonebreaker and Joey Hellerstein analyze this phenomenon by detailing 9 major data modeling trends and proposals spanning 35 years. By doing so, they aim to shed light on the primary lessons to be gained from each system’s successes and/or failures, in the hopes that current designers do not repeat the errors of the past. The authors begin by describing early data models (IMS and CODASYL) used in the 1960s and 1970s. Work with IMS established the importance of fundamental principles such as physical and logical data independence, and CODASYL used a directed graph approach which was more flexible than IMS in the types of representations it could handle. Unfortunately, these early systems suffered from a great deal of complexity, such as in altering relationships between entities, which made them relatively slow and costly to develop. These hierarchical and directed graph systems were widely replaced by the relational model (though not without a great deal of debate), due in large part to the latter’s conceptual simplicity and speed. One important lesson that can be learned from this transition is that success is influenced far more by which market-leading companies decide to adopt new technology rather than the technical merits of competing systems. Much of the rest of the paper describes various attempts to shore up various deficiencies or add new functionality to the relational model, ranging from extended relational to object oriented, to many others. In the end, almost all of them failed have a major commercial impact mainly because they did not bring enough improvement to justify whole-scale changes by major companies. The paper concludes by addressing the newest (at the time) trend of semi-structured XML models, essentially arguing that these designers are repeating the same mistakes of the 60s and 70s by coming “full circle” back to complex data models. In the authors’ eyes, the XML and relational advocates are rehashing many of the same arguments made during the original “Great Debate” between CODASYL and relational models. They urge us to take heed of the lessons learned decades ago, rather than potentially painfully relearning them again. The primary strength of this review article lies in the detailed analysis of the major trends from the beginnings to the time this paper was published. While it does not perform any original research, it still serves a valuable role by summarizing the major lessons (in the authors’ eyes) learned from each system, many of which are still fundamental building blocks in database design today. Overall, the paper was well written and organized into 9 conceptual blocks, making it easier to grasp the considerations and challenges that designers back then faced, such as the theme of the tradeoff between complexity and functionality. One potential criticism that can be leveled at this paper is that perhaps the authors were a little too opinionated in addressing the merits of semi-structured data models, which is understandable given their goal of convincing readers that it is a bad idea. While the authors do present some convincing technical objections to the new trend, the inclusion of empirical data with regards to efficiency, cost, etc. for systems actually in use would be much more convincing one way or the other. After all, as frequently noted throughout the paper, the market ultimately decides who wins. |
Knowing the history of development of data model is the key to overview this subject and is vital for leaner to dive into advanced database management systems. 35 years of data model proposals can be grouped into 9 eras. Only few basic proposals appeared in this period and most proposals are similar to these basic proposals. This paper is mainly about the summary of data model proposal in each era, what is the weaknesses in each era and how they are corrected as time goes on. IMS had a hierarchical data model and developed the notion of a record type. This first data model is based on data manipulation language DL/1 with poor physical and logical independency, also, data is redundant because each record instance can only have one parent. CODASYL data model is made up of record type and set type which are based on directed graph. A given record instance in this model can have multiple parents which made it more flexible, however, the overall model is more complex than IMS because the more complex object had to be bulk-loaded all at once with long load time. CODASYL had a worse performance on logical and physical data independence than IMS. The third one was relational model which stored data in tables. This model had more flexibility and data independence than previous models because of the simple data structure. SQL was created as a new set-at-a-time DML. Next model is Entity-Relationship model made up of entities and relationships. E-R model was not accepted by users, instead, it was wildly used on data base design because it was easy to understand for DBAs and to convert into tables. After that, relation model and query language got developed, the best application was called Gem. Gem’s extensions include set-valued attributes, aggregation and generalization; However, it had no noticeable improvement on performance, so it failed to become popular. The next era is semantic data model era. The flexibility of class is the main character of this model. Object-oriented database appeared after that. It is able to convert from programming language to data base speak and back. In contract, it requires the compiler for the programming language to be extended with DBMS-oriented functionality. Also, there is no standards for this combination, so it failed to generalize. Next era is dominated by the object-relational model. This model allowed user to define data types, operators, functions and access methods. In addition, OR model was commercialized by a company that was good at cooperate with other company to cover its drawbacks. Hence, OR was widespread over the market. Semi structured data is more flexible in schema. Data model is specified in XML schema. This paper has a character that it combines the evolution of technology and performance in marketplace. It allows readers to know the development of technology from both functional advantages and the how it is used in practice. |
This paper outlines a history of database models in order to show the reader the strengths and weaknesses of different representational models throughout history and the motivation behind them. The hope of the article is to educate the reader so that they may build upon the findings of the past and not repeat the mistakes of our predecessors. The paper shows the patterns of different proposed database models that become widely accepted versus those that never have quite the same footprint either due to the lack of support in the developer community or the lack of support from big businesses. The first model the paper looks into is the earliest, IMS. This is a tree based data representation of records which are your data objects. In this model you repeat information if different instances of the same type of child have different parents. In our Supplier-Parts example from the paper the same part at different suppliers leads to repeated part information in one schema and the same supplier carrying more than one part leads to repeated supplier information. We also face the corner case of not being able to have a child without a parent. The data manipulation language DL/1 uses a hierarchical sequence key (HSK) ordering of searching your database. What this means is that in the parent-children tree structure of your database it searches depth first and then left to right at each “generation” of the tree. With the sequential storage of this model you must rebuild the list with the record insertion in order and swap for the new master, which makes for inefficient modification of the data. This introduces the problem of a lack of data independence because there is no guarantee that the program will run if you modify the records in the database. Logical databases allow us to abstract out the physical data from the program so that it will run even if you change the structure of the records or add and remove records. This record at a time interface also places it on the programmer to do query optimizations. The second type of database model we look into is CODASYL. This is a graph based model in some ways similar to its predecessor IMS. In this model, the a record called an “owner” of the different relationships, known as “sets”, can point to n different child record instances. IMS Is a hierarchical space while CODASYL is a “multi-dimensional hyperspace”, which is due to IMS being built on the tree data structure versus CODASYL being built on the directed graph data structure. CODASYL must keep track of multiple currency indicators and these move around the hyperspace until you find some data of interest, as opposed to the IMS representation, which only keeps track of one current position and the ancestor (for a get next program). More complex representation of non-hierarchical data and poorer physical and logical independence than IMS Next we look at the relational model, proposed by Ted Codd. It was superior to CODASYL because of improvements on data independence, ease of program optimization, and flexibility to represent common situations like the marriage problem (unlike CODASYL). Conservatives argued against the difficulty of adoption and put forth complacency with the fact that CODASYL could represent tables already. Set at a time is better for programming improvements. It allows the programmer to query based on relationships to an owner record. With the help of IBM, VAX computers started to hit the market in on a large scale which led to the victory of relational systems after this debate. The success of VAX and difficulty of porting CODASYL systems led to the adoption of the relational model in some of the more affordable computers on the market at the time. Next we looked at the Entity-Relationship model proposed by Peter Chen, which was overwhelmed by the concurrently erupting relational model of the 1970s. We looked at the R++ model which had minor impact. Yes they made simple syntax possible for referencing sets of records with a certain attribute value, but that was not a big enough improvement. We looked at the Semantic Data Model (SDM). SDM did not find footing either because it was only an improvement on the R++ model in that its generalization was done using graphs instead of trees, which was not a significant enough improvement over the relational model most prominent in the market at the time. Next we looked into the Object Oriented Database Model, which aimed to support C++ as a data model. We needed performance competitive with C++, which could not be achieved with the lack of support from the developer community. We looked into the Object Relational Model, which blurs the separation between code and the database model. This saw some adoption as it was commercialized by Illustra. In summary, we see how in today’s database model thinking we have seemingly retrograded to a complex model which faces similar faults of the earlier database models in history. |
This paper has summarized 30 years evolution of DBMS including technical specifications and market perspective, and give a very clear view of each database's role during the history. It starts from IMS, a hierarchical(tree structure) data model, generated under the influences of DL/1(data manipulation language) which induced poor Physical data independence, restrictive and big complexity, makes programmers hard to do manual query optimization. Then comes with a graph structure model called CODASYL, it was still showing No physical data independence, but proposed the notion of a logical data base. It was flexible but More complex than IMS. Ted Codd proposed relational data base which stores the data in a simple data structure and can access it through a high level set-at-a-time DML. It showed Physical data independence and logical data independence leveraging on high level language and simple data structure. E-R model proposed that a data base be thought of a collection of instances of entities and entities have attributes. The Author also gives reasons why Normalization theory failed, as it fails to explain “How do I get an initial set of tables?" and it was based on the concept of functional dependencies, and real world DBAs could not understand this construct. R++, though not succeed in market, but did have contributions: it has the properties of 1)set-valued attributes(add a data type to the relational model to deal with sets of values) 2)aggregation (tuple-reference as a data type) 3)generalization: Each specialization inherits all of the data attributes in its ancestors The Semantic Data Model focuses on the notion of classes, it 1)allows classes to have attributes that are records in other classes; 2)enable multiple inheritance so that classes can generalize other classes; 3)Classes can also be the union, intersection or difference between other classes 4)classes can have class variables,Ships class can have a class variable which is the number of members of the class. The idea of OO model gradually makes things object oriented and easy: it 1)does not need for a declarative query language 2)also no need for fancy transaction management 3)The run-time system had to be competitive with conventional C++ when operating on the object The Object-Relational(Postgres) model appeared because the need for efficiency in multi-dimension searching(no way to execute this query efficiency with a B-tree access method). The greatest idea of OR model is that it encapsulates a chunk of code that have specific function and pass the parameters to the encapsulated function so that it becomes more maneuver-friendly and extensible. Semi Structured model has two basic points: 1) schema later(easy schema evolution) 2)complex graph-oriented data model. The author also introduced XML data model which has the properties of 1)be hierarchical 2)can have links 3)can have set-based attributes 4)can inherit from other records in several ways. The author also gives summeries on XML/XML-Schema/Xquery and proposed that XML will be a popular “on-the-wire” format for data movement across a network, and XML is sometimes marketed as the solution to the semantic heterogeneity problem The strengths of this paper is that is comprehensively analysis the technical, market reason behind each era of DBMS and make everything happen under expectation. It is hard to evaluate the drawbacks of this summery paper. N/A for this paper. |
This paper summarized data base models that are proposed in last 35 years. Since most researchers today have not start research 35 years ago, this paper would serve as a good material to learn about the history. The main purpose of review all those models proposed is for researches not to repeat the history (mistakes). This paper has divided 35 years in to 9 epoches. The author introduced those database models one by one and also point out the lessons that researches should be learn from those models. 1. IMS has a hierarchical data model. The record types are stored in tree, and each has a unique parent record type. This structure introduced redundant records problem and each record depends on the existence of their parent. The data independence in IMS is complicated and the tree structured data model is restrictive. 2. CODASYL is a directed graph model that solves many restrictions of the previous hierarchy model. So each record type would have several parents. A name arc named set type is used to connect between a parent and a child. However, it has poorer data independency than IMS and loading is more complicated. 3. Relational model is proposed by Ted Codd, which stores the data in table and use a set-at-a-time DML. It is flexible to represent any relations and it achieves a good level of both physical and logical data independency. 4. Peter Chen proposed Entity Relationship in 1970s’, as the database is a collection of entities, entities have attributes. Certain attributes builds key, and there can be relationships between entities. The ER model succeeded in schema design area. 5. In the R++ era, researchers tend to add additions to relational models. They are attributes that should be chosen from set, aggregation and inheritance of record types. However those extensions do not bring much performance improvement. 6. The Semantic Data Models appear in post relational data models era. It focus on classes and support aggregation and generalization with even multiple inheritance. However, it can be easily simulated by relational model and did not improve much performance 7. OODB tends to support SQL from ugly sublanguage embedding into persistent programming language. It would be much cleaner, the man efforts were spend on C++. However, it does not have a standard and it did not get supports from community. Most importantly, it does not solve a major problem or provide a major improvement. 8. The Object Relational model was born to solve the two dimensional query happens in GIS. OR model will support user defined data type, operator, function and etc. The contribution of OR landed on its better mechanism for stored procedure and user defined access method. It also allow code to happen in database. 9. The last one is semi structured data. One can do schema later, it is only appropriate for semi structured data(not a large market). Another is schema evolution, it allow people to change the schema along the time, the old schema can be represented as a certain view. XML can support most of the operations mentioned above but it is a very complicated. The author was not that optimistic with XML. The author summarizes the development of the last 3 decades as complex to simple to complex. The strength of this paper can be summarized as follow. 1. This paper elaborates every important things happened in the past 35 years. This is an invaluable resource for any researches. 2. The author not only describes the facts, but also points out the key that promotes the development. 3. The author also analyze each model in details, pointing out its main contribution strength and weaknesses. 4. This paper also provides a reasonable prediction of future development and gives corresponding suggestions. The weakness of this paper can be summarized as follow. 1. This paper has 43 pages, which is a little bit too long. The author may consider to spend less words introducing the stories happened behind scene. 2. This paper formed along the timeline. The author may consider adding a section to compare all the models, which will help the reader to have a deeper understanding of these models. |
The paper addresses different data models and ultimately the cyclical nature in which data models have evolved over time. The bulk of the paper lays out the groundwork by describing each data model in detail and occasionally comparing it to its predecessors. The authors take the stance that the cycle is bad, as "to avoid repeating history, it is always wise to stand on the shoulders of those who went before, rather than on their feet" (40). It describes clearly what should have been learned from each model. Finally, it concludes with discussion of an emerging model (Semi Structured Data), and compares that to CODASYL, completing the circle. The paper discusses XML which was presumably fairly new at the time, but it remains fairly relevant today, as many newer databases are storing JSON data. I personally see this paper as more of a philosophical contribution to the field than a technical one. Most of the paper addresses the past, as opposed to new algorithms. The exception is a section about the potential future of XML, which feeds into the conclusion. Essentially, the paper urges readers to understand the past before designing new systems. It's an important message, but I would hesitate to call it a technical contribution. I think the big strength of the paper is that it serves as a reminder to not get caught up in "the next big thing" as a researcher. Acquiring a base knowledge is important, and that knowledge can prevent us from wasting resources. Ultimately I felt as though the conclusion of the paper was a bit weak, in that it led the reader to draw his/her own conclusions about why the cycle described is happening. The conclusion that we aren't learning from history because we are going back to previously proposed data models is a bit extreme. I believe that from what is laid out in the paper, many of the failures happened due to the context of a certain time. Obviously nobody can predict the future, but some assertions seemed quite odd with "2018 lenses" on - such as the idea that XQuery would be translated to use a SQL backend, when NewSQL is currently quite trendy in the database community. Overall, I thought that the lack of mentioning the fact that old ideas can succeed in a new context was disappointing. Neural nets have become highly important after many years of obscurity. I see no reason why an old data model (that is not inferior in all aspects to a relational model) couldn't become relevant, provided we have a changing market and evolving technology. |
This paper summarized the data model proposals with in the last 35 years from the 1960's. It divide the this period of time into 9 eras and introduce the concepts and application of data models purposed in the 9 eras. It also made discussions about the models commercial use and lessons learned from the development and research of proposed models. Purpose of writing this paper is to help new generation researchers learn the history of data models and try to be smarter not to repeat the history. Brief summery of the 9 era: 1. IMS: hierarchical data model. Tree structure. Record-at-a-time Drawbacks: Lack physical and logical data independence. Information is repeat. Tree structure restrictive, for example, existence depends on parents. 2. CODASYL: Direct graph structure. Set type with owner and child. Record-at-a-time. Drawbacks: No physical data independence. Some level of logical data independence. It is a complex model compare to IMS. 3. Relational: Store data is table(simple). Set-at-a-time access. Physical data independence. More flexible model compare to CODASYL. SQL as a user friendly language. Use query optimizer to replace the manual optimization in IMS and CODASYL. There is a "great debate" between Relational and CODASYL. The result is Relational model won due to support from elephant in the marketplace. 4. Entity-Relational model: concept of entities, attributes and relations. ER diagram as popular representation of the model. Popular as a database design tool. Drawback: Idea of functional dependency is too hard for mortals to understand. 5. R++ era: Mostly some features added to relational model. For example: set-valued attributes, aggregation and generalization. Drawback: Not a great success, no big performance improvement or functionality advantage. 6. SDM era: "Post-relational" model. Multiple inheritance, classes. Drawback: This model is usually very complex and can be implemented in relational mode. No big performance improvement. 7. OO Era: Address the similarity between DB and Programming language like C++ like Struct and data type. Persistent programming language is clearer than SQL embedding. Drawback: Companies are not willing to pay big money for this feature. It is not their pain point. Lack the support from programming language community. 8.Object-Relational Era: user-defined types, operators, functions and access methods. Ability to customize the DB for particular needs (General purpose extension mechanism). Example: used in GIS system. Stored procedures: ship the code to DBMS side, save communication time. Drawback: Lack standard. 9.Semi Structured Data: XML.Schema later and Schema Evolution. Schema later is a niche market and there is not much example of cases makes schema later useful besides resumes and ads. Schema evolution is a weak side for current DBs. XML data model vs. Relational mode. This is like the last "great debate". A complex model vs. a simple model. XML marked itself as solving semantic heterogeneity problem. Authors opinion is that this problem is not widely exists or can be solved in many other ways. This paper showed us the three/four decades of data model thinking and helping the young researchers to learn about many previous work in the area. It would be a better paper if the author can share more thoughts on the current situation of data model thinking and having a prediction to future based on previous experience. The latest json, Hadoop and Spark appear after this paper shows the latest trend in our thinking about data model. |
This paper summarizes the history of data model proposals since the late 1960’s. For each data model proposed, it provides the motivation, briefly explains how the model works and presents the lessons learned from the proposal. It’s crucial for new researches to understand the shortcoming of previous models, learn from it and avoid replaying the history. The paper divides the history of data models into 9 eras. Here’s a very brief summary for each of them: Era 1 – 2: Data models in both eras focused on record types and arranged them into either a tree(hierarchical) or a directed graph structure. Their problems were also quite similar: poor physical/logical data independence, repeated information, record-at-a-time query language where programmers need to make optimization tradeoffs. Era 3: To solve the data independence problem, Codd proposed relation model where data should be stored in a simple data structure called table and be accessed through a high-level set-at-a-time language. This high-level language provides physical data independence since programmers only need to tell the DBMS what they want to get and the query optimizer can figure out how to do it. Logical data independence can also be achieved easily due to the simple data model. Era 4: Entity-Relationship data model invented the concept of entity, attributes, and relationship, which is different from all three previous data models. Though it was never implemented as the underlying data model in a DBMS, it becomes a very successful schema design tool. Era 5 – 6: In the extended relational era, many improvements to the existing relational model were proposed, including set-valued attributes, aggregation, and generalization(inheritance). In the Semantic era, a very complex data model focused on the notion of "class" which supports aggregation, inheritance etc. However, since these proposals offered little performance improvement, they had little impact. Era 7: In the Object-oriented era, people were trying to solve the “impedance mismatch” problem and persistent programming language are proposed. However, without the support of programming language experts and the fact that only a small market was available, it failed soon. Era 8: Object-Relation model. To meet the needs of all different markets, the user should be allowed to customize their DBMS. Therefore, user-defined data types, operators, functions and access methods are proposed. The proposal of UDF also pushed code to the database side and offered great performance enhancement. Era 9: In the semi-structured data model, the user can choose to specify the schema later or not at all. However, this brings the “semantic heterogeneity” problem which is hard to solve. Writers of the paper also think there’s only a niche market for the schema later approach. In terms of the data model itself, it is too complicated and included features nobody ever seriously proposed in a data model. Author to the paper deem this as a repetition of the first two eras and the current debate between XML and relational advocates is quite similar to the “Great Debate” between relational and CODASYL. Overall, this paper is great for new researchers entering the database field. It helps them learn from not only the history but also lessons people learned from it which is even more important. |
In the paper "What Goes Around Comes Around", Michael Stonebreaker and Joey Hellerstein trace 35 years of data model proposals and classify them into 9 different eras. Both the successes and failures of each era are highlighted and critically analyzed. They propose such an organization in order to prevent future researchers from repeating the same mistakes as their predecessors. Since newer researchers are ignorant of previous works, the cycle of history repeating itself seems inevitable. This cycle can only be broken when they are informed of the lessons learned through decades of "progress". Thus, more effort can be directed to move us forward rather than backwards. We start with the birth of research - early data models such as IMS and CODASYL which suffered from foundation that they were built upon. IMS had a hierarchical tree structure and was prone to information redundancy, existence dependencies, and the inability to represent common n-way relationships. CODASYL, its successor, solved these problems with a directed graph structure. However, even though most of these problems were solved, they were not done so elegantly. There were many trade-offs that consequently increased complexity and load times making crash recovery a nightmare. By the end of this era, researchers were confident that they wanted physical and logical data independence, but a much simpler means of containing data. Thus, we enter perhaps the most interesting era - the relational and entity-relationship era. These data models used tables to effectively eliminate the problems that CODASYL faced. Yet, this sparked a great debate between those that felt change was not necessary and those that differed. Ultimately, since the field was at a standstill, industry (specifically IBM) became a major factor to the success of the relational model. Furthermore, the entity-relationship model which appeared to be an extension of the relational model also gained traction due to its ability to represent relationships and normalization. This led many to believe that they could customize their data models to their needs. Extended relational and semantic data models were created and were essentially add-ons to previous models that assisted in applications to a specific domain. Under these circumstances, they were lacking a performance or functionality advantage and had little long term impact. Not long after, the craze of object oriented programming attempted to take over this field. This led to the creation of Postgres, a performance leap which was initially intended to assist with geographical information systems. We now reach XML - what Stonebreaker most likely would have referred to as "the cornerstone for new debate". This takes a complete 180 from the simple designed models to models that are much more complex than CODASYL. Stonebreaker doesn't recognize the XML approach to be useful and merely something that will come and go. He believed that niche approaches that only target a subset of users will have no long term impact. However, his restricted point of view neglects the fact that many companies today use non-relational databases - a field that he believed would be fixed by simply converting the problem type to a schema-first problem. Although not as popular, non-relational data models have secured a spot as the means to minimize latency and scale their backend at the cost of "less friendly" user data. |
This paper summarized the various significant proposed data models throughout the history of DBMSs. Initially we have IMS and its hierarchical data model, which defines any number of record types, which much be organized in a tree-like structure. This has limited physical and logical data independence, has difficulty representing some relationships between records, and requires queries to be manually constructed and optimized. CODASYL was a next step, which was based on a directed graph model relating records to each other. This ended up having virtually no physical data independence, and being more complex than IMS, but it allowed far more flexibility. Following this was the relational model, which divided everything into tables. This focused on being a higher-level interface, which gave better data independence, and was even more flexible than CODASYL, being able to represent nearly any kind of data. This also introduced SQL, which made for much easier queries. IBM ended up supporting relational databases in its systems. This, along with the introduction of query optimizers, ended up making the relational model a new standard. Afterwards, most data models had smaller changes, or changes that ultimately weren’t accepted. Entity Relationship models didn’t change DBMS design, but ER diagrams proved useful for constructing database schemas. R++ additions to the model tended to be small improvements with little impact and were mostly ignored. Semantic data models proposed “classes” as a set of record instances for a broader but more complex model. Ultimately, there wasn’t enough interest, and this was also set aside. An object-oriented model tried to take advantage of SQL usually being embedded in code to propose a data model more like a standard programming language, to make this embedding work more naturally. There wasn’t enough demand for this model however, and not as much support from the programming community. The object relational model was more successful, and made some improvements based on standard programming languages. Some data usage like GIS requires non-standard data types, so this model allowed the user to define custom data types, along with custom operators and access methods for this data. Postgres is a DBMS that implemented this model, while also allowing the user to store code along with standard data. Finally, the paper describes semi structured models, including models where the data schema changes or is not know ahead of time. This also includes XML, which the authors of the paper seemed highly skeptical of, describing it as too complicated and likely to produce similar debates to those between CODASYL and the relational model, with similar results. The authors’ final note is to be aware of the history of data models so that we don’t needlessly have the same arguments and do the same things that have already been done. |
This paper summarizes the development of data models over 35 years and make comments on each model based on their design logic and analysis of reasons of success or failure. It’s very important cause it help beginners get a big picture of different data models and their development history. Some comments also reflect authors’ professional idea and analysis on data models. The paper grouped the summary into 9 different eras, discuss proposals and conclude lessons respectively. IMS and CODASYL was first introduced. IMS used a tree-based model to store data, which cause data redundancy, while CODASYL used a network-based model to solve this problem. Both models’ query language is record-at-a-time imperative languages and both models lack physical and logic independence. This motives relational model to emerge, which structured data as relations-set of tuples. The query is set-of-a-time and relational model perfectly solve the problem of physical and logical independence. There is a competition between CODASYL and Relational model. The relational model was criticized to be too academic, the queries are hard to understand, which motives SQL to emerge. There are three unsuccessful modifies to relational model, including entity-relationship, R++ and semantic data model, which illustrates that new constructs will go nowhere unless there is a big performance or function advantage. Modifies should meet the rule of KISS and solve the “major pain”. The Object-Relational model was a successful modified version. It’s first introduces in GIS system and it’s featured to allow users to define their own data types, operations, functions and access methods. OR is nearly the most popular model now, one barrier is missing standards for UDF. Semi-structured data was last introduced. XML is a popular data format for data movement. XML DBMSs is potentially popular but doubted mainly because schema-last is just for limited market and complex network model is against KISS rule. XQuery is pretty much OR SQL with a different syntax. One drawback of the article is that it did not mention the newest data model like JSON. |
The paper What Goes Around Comes Around gives a summary of data model proposals over 3 decades by grouping them into 9 eras. Author layouts the paper in chronological order, which allows readers understand the background each data model emerged from. For each era, author answers questions including why the data model got proposed, what the data model is, what advantages and disadvantages it has, if it is accepted by market, and lessons learned from each era. The structure helps clarify why and how data models progressed. Another great aspect of this paper is that author tries to use the same example applying on all data models, which help readers understand differences between the data models. However, the paper does have several limitations: 1). Many data model eras share similar basic modeling idea, which can be grouped together for discussion. For example, relational, entity-relationship, and R++ are similar in my opinion, it would be clear to reader if there is a section comparing the differences. 2). This paper provides figures to illustrate how data models work, which is good for understanding. However, there is not enough explanation for each figure, which left me puzzled. For example, figure 8 in CODASYL era has no explanation, and I still don't know why Supply would point to another Supply. 3). Author uses many abbreviations without explaining what they are, which makes it hard to track for readers who do not have much background knowledge. This paper makes a interesting conclusion at the end that history is repeating itself in data modeling. Author thinks that XML shares similar traits with IMS in complexity. To avoid history replaying, one should understand history first. |
“What Goes Around Comes Around” (Stonebraker and Hellerstein, 2005) is a survey paper of major DBMS data model proposals from the late 1960s until 2005, ranging from hierarchical and directed graph models early on; to relational, entity-relationship, and semantic next; then object-oriented and object-relational; and more recently semi-structured. In particular, the paper highlights why particular data models and approaches were or were not successful, and also synthesizes lessons learned about data model design and adoption factors. The authors explain that they wish to share the history of DBMSs so that more junior DBMS designers 1) do not accidentally replicate prior approaches when trying to conduct original research, and 2) do not repeat mistakes encountered in prior approaches or evangelization attempts. While discussing how data model approaches have evolved over the years, some of the major topics the paper considers are the importance of data independence and what designs achieve it; flexibility versus complexity tradeoffs for hierarchical, tree, and directed graph data models; set-at-a-time languages being more usable for end-users and providing physical data independence in contrast to record-at-a-time languages; the phenomenon that new constructs and features that do not greatly improve performance or increase functionality will have low impact; object-relational models and benefits they provide in performance and extensibility via user-defined functions; and semi-structured data, the likely nicheness of the “schema later” model, and the over-complexity of the XML and related data models. In addition to technical lessons in design, the paper also considers more social factors that should be considered when designing and marketing a data model system or approach. Whether a new approach or system offers substantial benefit over existing options, whether it is compatible with existing infrastructure, and whether it is easy to use are important factors organizations consider when choosing, switching, or modifying their DBMS. Interestingly, the authors demonstrate that politics and marketing also play a factor in adoption and whether a class of approaches will continue to be explored. For example, IBM’s status in the marketplace and their eventual choice of relational systems over the existing hierarchical and directed graphs led to their demise. These lessons about effective DBMS design approaches and factors in adoption are valuable, especially as they are rooted in examples from the past. Ideally a reader could take those examples and consider what may or may not be promising research directions in the current day. As a stylistic observation, the fact that the paper enumerates single-sentence contributions also helps the reader understand the major takeaways. The rudimentary explanations of hierarchical, directed graph, and relational data model, as well as the common example with Suppliers, Parts, and Supplies, were helpful to a novice reader. It could have been interesting to see how data model design history and adoption factors compare with those of other technologies and areas of computer science, for example parallel computing, architecture, and programming languages. Certainly the authors are DBMS researchers, so the paper is an overview for DBMSs. However, perhaps challenges and lessons learned from the design and adoption of other computer system technologies could be relevant, so that the reader and broader DBMS community preemptively consider what mistakes to avoid in the future. |
This paper is a summary of data model proposals over the past 35 years, from 2005's perspective. The authors separate the proposals into 9 different eras. In each era, they explain the proposals in detail with benefits and weakness and use examples to illustrate the data model. From a higher point of view, the authors argue that there are only a few basic data modeling ideas, the only concepts new appear are "code in database" and "schema last". This paper can help readers unfamiliar with the DB history to understand the development of this research area. The authors point out that the debate between XML advocates and the relational crowd resembles the debate between relational advocates and CODASYL from a quarter of a century ago. This is where the title "what goes around comes around" comes from. Besides the ideas of different data model proposals, what I learn from the paper is, always do research on the history of the area before implementing a "new" idea, we need to understand what was previously learned and avoid replaying history. BTW, another interesting part of the paper is that it discussed why R++, SDM and OO Era models failed in the market while OR era proposals had a success. It's an interesting and important point of view. Since it's a survey paper, it has no technical contributions. Considering I'm not familiar with the history of DB, I cannot judge whether it has some drawbacks. |
This paper focuses on describing the data models since the 1960’s in chronological order and summarize the lessons people learned when designing these data models. Doing this work is very important because by learning from the history, database researchers can avoid the mistakes happened in other previous data models. Also, it can help them generate ideas and build better models. In their paper, they listed 9 data models and make a comparison between them, in the next part, I will summarize each of them and give my personal idea. 1. IMS Era: It is a hierarchical data model that uses a tree structure for recording. This result in data redundancy problem which may cause anomaly problems. Also, some schemas rely on the existence of their parents restrict the data insertion. IMS choose DL/1 as its DML, it is a record-at-a-time language so that programmers need to do manual query optimization. Although it supports some level of logical data independence, it limits the amount of physical data independence. 2. CODASYL Era: It is a network data model which is more flexible when comparing to IMS since a record instance can have multiple parents. However, it offers poorer logical and physical data independence than IMS. Also, for the CODASYL DML, it requires programmers to keep track of the last record, which make the data manipulation much harder. 3. Relational Era: In order to provide better data independence, Ted Codd purposed a very famous relational model. In the relational model, data are stored in simple data structure and a high-level set-at-a-time query language is applied. The rise of relational model causes a competition between relational model and CODASYL. For relational model, people create SQL based on relational algebra, the automatic query optimization makes efficient query much easier. 4. E-R Era: The entity-relationship (E-R) model use a collection of instances of entities to model the records and define different kind of relationships between entities. E-R model is a successful model, people can do schema design by drawing the E-R diagram at the beginning. Also, the schema normalization theory helps schema decomposition, which can be helpful to avoid insert, update and delete anomalies. 5. R++ Era: In this era, some query language extensions feature was introduced, like set-valued attributes, aggregation and generalization. However, these features don’t provide noticeable improvement when comparing to relation model. Thus, R++ did not make a big impact on commercial world. 6. Semantic Data Model Era: The SDM use the similar idea of R++ but focuses on the notion of classes. In SDM, the class generalization can be a graph that enables multiple inheritances. However, due to the complexity of SDM, it is not very successful in industry. 7. OO Era: In order to make a connection between DBMS and object-oriented languages, OODB was purposed. They want to use a persistent programming language to do both of the jobs which make the code cleaner than a SQL embedding. In the mid 1980’s, OODB persistent in C++ and successfully implemented. However, it was not admitted by the market. 8. Object-Relational Era: For OR, it allows users to customize the database by their demand like user-defined types, operators, functions and access methods. The prototype was called Postgres, the UDTs and UDFs in Postgres make it easy and efficient when compared to relational systems. This makes Postgres gain some commercial success. 9. Semi Structured Data Era: This model uses the schema last idea that requires self-describing. According to different use case, we should choose using schema first or schema last. The XML data model is a kind of semi structured model, it’s a very complicated model so that it may fail due to the complexity. A good feature of XML is that it provides good data movement across the internet, that’s why XML become popular today. The technical contribution of this paper is that they give a wonderful summary of how the data model evolved since the late 1960’s. It gives people a clear scope of why these models come out and why someone evades. Besides, it summarizes the lessons people learned in the development of DBMS and through which enable people to avoid replaying the history. They give critical judgments about each model including their backgrounds, pros and cons and rich examples in detail which make it easy to follow and understand. The main drawback of this paper is that it looks a little bit old fashioned since this paper was written more than 10 years ago, it does not contain some recent data model and database technologies. With the coming of the internet era, data models evolved very quickly. Like graph, document, column-family, key/value and etc. are not introduced in their paper. Also, new DBMS like NoSQL and NewSQL are not talked. For the Semi Structured Data, the author predicts that model will not become popular, but it does! They are widely used for current DBMS. For some NoSQL databases, XML and JSON are widely used for their data model since they can be much easier transferred through the internet. In a word, it’s a great paper and I learned a lot by reading it! |
At its core, “What Goes Around Comes Around” is a survey of the last 35 years of research/innovation with respect to data models. The authors divided this research into 9 eras, each of which was colored by a particular proposal or set of proposals. For each era, the authors also analyzed these proposals and identified the strengths and weaknesses of each approach, as well as “lessons” that were learned from each era and the industry impact that these proposals had. There were several themes that the authors were focused on in their analysis. In particular, the idea of complexity seemed to be centric to their discussion throughout the paper—the identified problems/weaknesses of the earlier eras seemed to be largely a result of the complexity of the data models, both in concept and in practical use. The authors also identified a shift to more simple models in later eras, which was classified as a very good thing, and that the most recent era seems to be going in the opposite direction (more complex). The paper had 2 distinct strengths: 1) The “survey” part of the paper was of high quality: there was a large breadth of information covered which spanned a large amount of time, and analysis seemed to be fair with respect to each technology, with benefits and disadvantages being named for each. Also, the authors provided information relevant both to the research community, industry, and the interactions between the two. 2) In addition to the survey, the authors synthesized an opinion revolving around the then-most-current era of data models, which was XML—particularly that it would fail due to it being too complex, and as evidence the authors used the history of data model proposals that they had just presented, drawing a parallel between XML and the introduction of CODASYL, which also failed. I believe the paper also had some minor weaknesses. For example, I think some of the “lessons” learned from each era were too strong of conclusions to make from the examples they gave. For example, lesson 9 is “Technical debates are usually settled by the elephants of the marketplace, and often for reasons that have little to do with the technology.” Even if this is intuitive, the only example they gave for this was IBM’s influence on the IMS vs CODASL debate, which hardly seems like enough evidence to make a blanket conclusion like that. Also, I think that at times the tone of the paper was more haughty than necessary. |