Corporations often need to work with data from different sources, each with its own schema. Information aggregation combines data from various sources into a unified logical store, with just one schema. Unfortunately, it is difficult to derive semantic mappings over the schemas of the underlying data sources, which is needed before related data can be combined. Semantic heterogeneity refers to the problem where a single concept can be represented by many different schema hierarchies, or by many different names, making semantic mapping harder.|
Alon Halevy presents a review of semantic mapping, with example use cases from industry to motivate the problem, and a discussion of current approaches. One use case presented in the article is everyclassified.com, a web site that used to aggregate classified ads from many sources into a searchable repository. The site was challenging to create, because it used data from thousands of other sites, each with its own schema, and mapped every source into the common schema of the aggregator. The site had to overcome semantic heterogeneity to produce accurate mappings into the common schema.
Schema matching software uses many features to suggest likely matches, such as the names of schema items, data types, instance data values, the topology of each schema, and constraints like foreign keys. If many of these features are similar between a pair of items, it is likely the items are related.
The author proposes a machine learning approach to schema matching as future work. A schema matcher could be trained in a supervised way, on a data set of real-world schema pairs annotated with correct matches. A learner could also be trained to detect fundamental concepts of a domain, such as customers and orders, for better categorization of topics. The system could further be taught relations between concepts, such as that customers often have addresses. The system could also learn typical data types for each concept, such as strings for names. This would help the system establish better prior beliefs over the quality of a semantic mapping.
A future goal for schema matching is to support larger-scale problems with thousands of items in each schema. Tools for big schema matching might include new data visualization and user interface methods, to help a human analyst work effectively on large schemas.
What is the problem addressed?|
None. In fact, this paper exclaim an important problem, semantic heterogeneity, for us to work on.
The heterogeneity of data is increasingly challenging problem to deal with. Because the database shames might be developed by independent parties, these difference produces semantic heterogeneity. Making multiple data systems to cooperate with each other becomes a stunning challenge. This paper reviews several scenarios of semantic heterogeneity and some research and commercial progress to address the problem.
1-‐2 main technical contributions? Describe.
Enterprise Information Integration: Enterprises are facing data management challenges that involve accessing data residing in multiple sources, such as database systems, legacy systems, ERP systems and XML files and feeds.
Querying and Indexing the Deep Web: the information in internet is spectacular, how to access the deep web without the help of index or any knowledge about the meaning of the fields in the form.
Merchant Catalog Mapping: Semantic heterogeneity occurs in aggregating product catalogs. Consider an online retailer accepts feeds of products from thousands of merchants, each trying to sell their goods online. How to achieve the mapping from local form to homogenous form is the problem to deal with. The fundamental reason that makes semantic heterogeneity so hard is that the data sets were developed independently, and therefore varying structures were used to represent the same or overlapping concepts. As would be expected, people have tried building semi- automated schema matching systems by employing a variety of heuristics.
The author propose two points to work on: dealing with drastically larger schemas and dealing with vastly more complex data sharing environments. First, the schema matching tools need to incorporate advanced information visualisation methods, and finding a relevant schema in a large collection of schemas. Hence, instead of focusing on tools that solve the entire problem, but are inherently brittle, build robust tools that bring the users closer to their needs. Secondly, we need to manage a dataspace, rather than a database to incorporate different data sources.
1-‐2 weaknesses or open questions? Describe and discuss
I think the concept of data space is not very clear and wish the author can elaborate it more. I found this problem is quite insightful, since the semantic heterogeneity is everywhere. An important problem in natural language processing is synonyms which is a heterogeneity of language, and I am looking forward to the interaction between these two fields.
This paper is not about new designs and it presents a discussion about semantic heterogeneity. Semantic heterogeneity comes from the issue that different schemas applies to the same domain, including multiple XML documents, web services and ontologies. In many scenarios, resolving semantic heterogeneity is very important, including enterprise information integration (EII), querying and indexing the deep web, merchant catalog mapping, and semi-structured data. This paper first introduces some common scenarios where resolving semantic heterogeneity is crucial for building data sharing applications. Second it discusses some difficulties for resolving semantic heterogeneity. Then, it presents some research and commercial progress, as well as state of the art in addressing this problem. Finally, it provides some key open problems and future opportunities. |
The problem here is that resolving semantic heterogeneity is very important but it is also difficult. For example, this issue happens in aggregating product catalogs. Consider a schema where there is a hierarchy of products and their associated properties. There are a large number of mappings, but the best one is the one that ultimately sells more products. Resolving the issue is very hard. First, the data sets were developed independently which makes it possible to use varying structures to represent the same concepts. Second, it is also time consuming which requires both domain and technical expertise. There are many other problems that make it difficult.
The major contribution of the paper is that it provides a detailed discussion about semantic heterogeneity. It addresses the importance and difficulties of semantic heterogeneity in a very clear manner. Also it summaries the state of the art for resolving semantic heterogeneity, and we will summarize them below:
1. schema element names (find schema matcher from element names such as table and attribute names)
2. data types (consider data types as heuristic for ruling out certain match candidates)
3. data instances (elements from two schemas that match each other have similar data values)
4. schema structure (children of two classes will also match)
5. integrity constraints (on single attributes or across attributes)
One interesting observation: this paper is not about new designs but more a summary for semantic heterogeneity including problems, difficulties and current solutions. I think the open problem about larger schemas and schema search very interesting. We have larger and larger schemas nowadays, and this issue will become even more crucial in these areas.
The authors in this paper describe the challenges faced due to semantic heterogeneity when fetching information from different database resources.It begins by explaining us the scenarios of enterprises, deep web, merchant-catalog mapping where it is necessary to understand and resolve semantic heterogeneity.
CHALLENGES TO SOLVE SEMANTIC HETEROGENEITY
Though there are solutions like reconciling heterogeneity through semantic mapping, lots of efforts are required.Also, scaling of deep web makes it difficult and crawling them through search engines is difficult.Datasets were developed independently and the differing structures are a byproduct of human nature.The authors also state with past experiences that standard schemas don't work well and are not successful.Similarly, semantic heterogeneity needs to be resolved at the step where data provider exposes its data to its counterpart.
SCHEMA MATCHING HEURISTICS:
Following have been the classes of heuristics that have been used for schema matching which later is used for building the corresponding schema mapping expressions.Each of them have their own challenges when used for schema matching.
- Schema element names: Challenges to resolve synonyms, same names with different meanings, abbreviations and concatenations
- Data types- Challenges to resolve underspecified datatypes
- Data instances - Unreliable on their unavailability
- Schema structure- Challenge to find the initial match that drives similarity between neighbors
- Integrity Constraints
Another solution that can be considered is to leverage past experiences for schema matching purpose.Machine learning is one way where past techniques can be used for schema matching by using it to learn domain concepts and their representational variation, relationships between concepts and domain constraints.This implementation faces challenges while dealing with drastically larger schemas and complex environments.To look into the issue, the schema matching tools need to incorporate advanced information visualization methods.The attention of the designer should be focussed on the right place to view hypothetical mapping.The author also proposes a search engine that returns a ranked list of schema elements in the indexed schema that are candidate matches giving an example of Woogle Web service.The authors also focus on the challenges of managing data spaces due to the limited interfaces of some participants and their relationships and evolving integrations.
STRENGTHS AND WEAKNESS:
The paper discusses and brings to highlight the major challenges that are faced due to semantic heterogeneity.It provides a lot of examples of real time scenarios where matching and mapping of semantic heterogeneity becomes essential.It also describes various ways in which semantic heterogeneity issues can be resolved and the challenges faced in the course of resolving heterogeneity.The paper could have been more strengthened if it could have provided evidence to the solutions provided to resolve the semantic heterogeneity.
This paper addresses an interesting problem that we haven't yet touched on in this course. How do you get access to all of your data, stored in multiple formats and databases?
This problem can emerge for several reasons - corporate mergers and acquisitions can lead to some customers being stored in one database, and some in the other. Also, many systems were originally developed to fit a small business need, and as the business grew, their needs changed. Also, for legal reasons, some companies may have to keep many separate databases - for example, large financial institutions must keep a separate database for their credit card and banking customers - and if a customer uses the company for both, they will appear in both databases. While all of these schemas may model very similar data, the schemas will often be quite different.
The author of this paper attempts to explain why getting all of your data sources to interact is such a difficult problem. There are custom coded solutions, that works well, but are expensive and time-consuming to create, as well as extremely brittle to schema changes. This leads the author to suggest the real problem is finding semantic mappings between schemas. Creating these mappings is a manual, time consuming, and error prone process. The current tools don't do much to help the user - they let the user draw a line from one element in one schema to an element in the other. The author would like a tool that can propose such schema matches.
There are several techniques that are used in schema matching - however, none of them perform well in isolation. There is the simplest technique - just looking at the schema element names. This could obviously be improved by using a thesaurus or some other similarity score for words. Systems can also look at data types and data instances, to learn about that actual data that is stored in the schema. There is also schema structure and integrity constraints present in the data, that can be used to assist semantic matching. Using many of these techniques together, and using some kind of machine learning, can dramatically improve performance. This way, systems can learn from schemas that they've seen in the past, and become better predictors of future schemas and schema matches.
This paper was short, sweet, and to the point. It had a very targeted goal - provide an overview of why information integration is hard, and show a few of the cutting edge techniques that are being used. I can't such much bad about it, other than I have liked a bit more information about the learning from past experiences portion.
Problem and Solution:|
The paper proposed the existence of semantic heterogeneity. It is because the data are structure in different ways by different database or file type. Several scenarios involves enterprises data integration, deep web query, catalog mapping and so on. When obtain a single view of data or present an external view, multiple databases and sources will be involved and access for enterprises. Many approaches are created to solve the problem, and all of them requires us to solve semantic heterogeneity at first. The cost of semantic mapping is expensive in data integration. For querying the deep web, the form in a web has no certain standard for web crawlers to query the deep web. It causes the difficulty to both the crawler designers and the web designers because content from good database may be invisible. And the merchants and the online retailers require mappings between their product categories. Also, the schema heterogeneity problem exists in semi-structured data.
But the problem is hard to solve. It is because the datasets are developed independently with different structure and for different purposes. It requires knowledge from different areas. People have created several semi-automated schema matching systems to solve the problem. The systems find out the correspondence between two schemas and create schema mapping expressions. But they does not work well and the commercial products still rely on manual mappings.
The contribution of the paper is that it points out the semantic heterogeneity problem and summarizes the benefits and weakness of proposed solutions to solve the problem. It helps other designers to understand the problem better and learn from the past experience.
Also, it proposed a new approach, leveraging past experience, to solve the problem. The method uses machine learning to predict the mappings the system does not met before according to the provided mappings.
The paper is good in showing the problem and stating past experience. However, I think it still has weakness. The leveraging from past approach it proposed is not solid enough. It requires the construction of a very good corpus containing the useful concepts, variations and relationships. And if the tables change, the corpus need to be reconstructed.
This paper briefly introduces the semantic heterogeneity and the difficulty and current research trends in resolving this problem. Semantic matching is challenging in its variety of schema structures for the same domain of data. To solve this problem, all aspects of the schema, such as schema element names, data types, should be considered for the heuristics. It points out that machine learning is a promising approach, which leverages the past experience and will be able to give good suggestions for future matchings. In addition, the author also mentions two future challenges, advanced information visualization for large schema matching and higher abstraction level to manage ‘dataspaces’.|
A big picture of schema matching is brought out in this paper. It illustrates the motivation and challenges of the resolving schema heterogeneity in detailed description. It is a good paper for one who starts look up into this area.
It only mentions machine learning as the solution for schema matching. However, there might be other ways Machine learning is more like a instance based matching solution and there might be some schema-based matching solution possible. In addition, the author is optimistic about the machine learning method. Though machine learning might be a good way, yet it is apparently not easy to go through using this method. It would be much better if current challenges to “learn the past” were also included.
It is extremely common for databases to have differing schemas for its data, even if the data in both databases belong to the same domain. If the databases did not agree on a schema when they were both created, then it is almost certain that the schema will be different just because they were made by two independent parties. When one goes to try and access both data sources, they are met with the problem of data heterogeneity- trying to integrate data sources with different schemas can be quite a pain.
Typically, someone who wants to integrate this data will need to find a semantic mapping- a way to translate one data schema into the other. When there are a large number of data sources, like a website that tries to integrate information about different products, these mappings may be too time consuming to produce by hand. However, they cannot created in a fully automated way either- it is very difficult for a computer to be able to recognize that diverse terms can actually mean the same thing. Some companies have tried using a heuristic program that would assist a human DBA in creating semantic mappings. The program would use similarities in attribute names, data types, data values, etc. to suggest semantic mappings.
The author suggests that these heuristics are not enough, and that programs should also try to learn from the past- given past mappings, programs may be able to be trained to recognize (or give better suggestions on) future mappings.
The problem of large fact tables with hundreds of rows still remains even if this machine-learning based suggestion program were available- mapping hundreds of rows by hand is still a daunting task. The author concludes by saying that in the future, groups of heterogeneous databases should be managed together as a dataspace. At the most basic level, dataspaces should have an interface for retrieval of heterogeneous data. Users can then build on dataspaces by introducing their own semantic mappings, which facilitates more uniform retrival of data.
This paper is short (y)
The idea of a dataspace is interesting, and may actually be needed in the future.
This paper reads more like an op-ed than a paper.
Are there ways/solutions such that we can prevent ourselves from getting to the place where our data is heterogeneous?
The paper discusses the issue of semantic heterogeneity between multiple database schemas. It goes over various scenarios of semantic heterogeneity and talks about why resolving semantic heterogeneity is difficult. The paper ends with the discussion of the recent approaches in resolving semantic heterogeneity and highlight of key open problems in the area.|
The paper is not a technical paper and it is more of an introduction to the topic of semantic heterogeneity for people who may not be aware of this problem. The paper does a good job of describing what semantic heterogeneity is with various scenarios and also engages the readers well to focus on the issue of the difficulty of resolving semantic heterogeneity. The author suggests that a possible solution to resolving semantic heterogeneity is to leverage past experience.
While it sounds like an attractive solution to the problem, the paper is only able to give you an idea of how it might work without giving much information of a possible implementation. It is understandable that the paper does not go over the proposed solution in detail considering its scope is to merely introduce the topic, but nonetheless, I am not convinced that the proposed solution is feasible the current era of big data where not only the volume of data is huge but also the schema of data is evolving consistently and very fast. As we have seen something like BigTable, where it only provides a simple structure that can store any type of data and it is left for applications to interpret the data as they fit, I am not sure how the proposed solution will fit into today’s data systems.
In conclusion, the discussion of semantic heterogeneity in the paper is good and easy to grasp, but we should not pay too much attention to its possible solution or open problems in the paper as they seem a bit outdated for today’s data systems.
This paper discusses the issues with sharing data. When creating data storage areas, businesses and other entities create their data to suit their own business needs. The problem comes from when data needs to be shared between different parties. Where one party might have names a field customer_id, another party may have used c_id. Another example is if one party created a hierarchical relationship in the database, and other party did no such thing. Initial solutions to this problem involved created a user friendly interface that allowed users to manually create the mappings between two different schemas. To automate the mappings, some solutions attempted to use information such as similar naming schemes or data values, but these solutions ignored previous mapping experience. Furthermore, both of these solutions only work when the schema is relatively small. As time progresses, schemas are increasing in complexity and so more robust solutions need to be developed. The author points out one example of an automated tool that uses machine learning to leverage experience.|
I think a simple solution to this problem would be to standardize schemas for different business needs. Standardization is a common solution to problems that involve multiple different forms. Rome, for example, standardized wheel separation so all the roads could be of one form. Computer security protocols is another example of standardization in which certain actions are always taken between two entities when sharing information. I think the difficulty would come from legacy schemas. Getting new users to follow a standard practice for making schemas would be easy, but legacy systems would be very expensive to migrate to a new schema. This was not a problem identified in the paper, but I think might be an interesting alternate approach to explore.
This paper introduces the problem of "semantic heterogeneity" and several solutions to it. "Semantic heterogeneity" is caused when data comes from multiple sources with different database schemas. In order to provide the customers with present a unified external view of the data, some integration work needs to be done. The solutions to this problem are often called Enterprise Information Integration (EII). For example, the queries from the users might be translated into several customized queries for different data sources and the partial results are then combined to form the final results for the users. |
The reconciliation of the semantic differences between data sources is called "semantic mapping," which is a key issue of this problem. This is hard to achieve efficiently because it usually needs to be done manually and often requires some domain knowledge. As a result, we can imagine that it is much harder to solve this problem using programs without human judgement involved. Moreover, semi-structured data can also exist. That is, we might have data-level heterogeneity in addition to semantic-level heterogeneity.
This paper also gives some solutions. For example: perform some natural language processing (stemming, considering synonyms, hypernyms, ...) on the table and attribute names of the database; considering data types; considering the occurrence of similar data values; considering the schema structures and schema constraints. These heuristics are often combined to achieve better performance.
However, it is argued that these heuristics still only consider current evidence and neglect "past experience." Since the tasks are often repetitive, it would be a good idea to leverage past experience in the similar domain. We can use previous examples as training data and apply some machine learning technics to learn and help the schema matcher to make matching decisions.
The main contributions of this paper are:
1. It introduces the semantic heterogeneity problem and discusses the hardness of this problem and where the hardness comes from.
2. It gives some proposed solutions including the state of the arts.
The drawback of this paper is the lack of examples and experiments showing the effectiveness of the approaches. It would be better to see how they work on real-life heterogeneous data.
Motivation of the paper:|
Semantic Heterogeneity refers to the differences in database schemas developed by independent parties for the same domain. This is an important topic because data systems must understand each other’s schema in order to cooperate with each other. This article covers several common scenarios for which resolving semantic heterogeneity is important for building data sharing applications. The paper then covers why it is difficult to resolve semantic heterogeneity, and gives an overview of solutions in research and commercially. The paper ends with the main open problems and opportunities in this area.
Details on Semantic Heterogeneity:
One main scenario of semantic heterogeneity is Enterprise Information Integration (EII). An example of this is that obtaining a single view of a customer requires access to multiple sources. The previous solutions, data warehousing and building custom solutions, were suspect to accessing stale data and expensive, respectively. Later solutions involved querying multiple data sources in real-time, or peer-to-peer architectures for sharing data with rich structure and semantics. The key is to reconcile semantic heterogeneity, which is typically done through semantic mappings. Semantic mappings are expressions that specify how to translate data from one source to another while preserving the semantics of the data. In most data integration scenarios, over half the work is from creating the mappings. When querying and indexing the deep web, which is web content that resides in databases and accessible behind forms, deep web content is not usually indexed by search engines because the crawlers employed by the engines cannot go past the forms. The main challenge is in the scale of the problem. Merchant catalog mappings, or aggregating catalogs, are frequent examples of semantic heterogeneity. The problem is creating a mapping between thousands of merchants and a growing number of recognized online retailers. Heterogenerity occurs in the schema and the actual data values. Full integration of data from multiple sources requires handling of both semantic-level heterogeneity and data-level heterogeneity. Schema heterogeneity issues are magnified with semi-structured data because applications with semi-structured data are usually the ones that involve sharing data, schemas with semi-structured data is more flexible, and attributes can be added to the data at will. However, many times when integrating semi-structured data, it is enough to just reconcile the attributes that are used for equating data across multiple sources. Common instances of issues resolving schema heterogeneity are that one schema may be a new version of another, two schemas may be evolutions of the same original schema, many sources are modeling the same aspects of the underlying domain, and there is a set of sources that cover different domains but overlap at the seams.
Schema matching cannot be done completely automatically; the goal is to reduce the time it takes for humans to construct the mapping. One classes of heuristics used for schema matching is schema element names, in which element names contain information about the semantics. Challenges stem from homonyms and abbreviations. Another heuristic is data types, in which elements that map to each other have compatible data types; issues with this is that many schemas have underspecified data types. Data instances are another class of heuristics, for which elements from two schemas that match each other have similar data values. Schema structure matches elements in a schema that are related to other related schema elements. A challenge with this is finding the initial match that drives the similarity of its neighbors. Integrity constraints on single attributes or across attributes is useful for generating matches. Research focuses on solutions that combine multiple heuristics.
Strengths of the paper:
I enjoyed reading the paper because it provides a concise overview of semantic heterogeneity. It does a good job of addressing the various, real-world examples of where semantic heterogeneity is an issue. I also appreciated that gives a thorough discussion of heuristics used for schema matching, and found the integration of linguistics with this problem an interesting cross field connection.
Limitations of the paper:
I would’ve liked to see the paper provide quantitative experimental results for the different schema matching heuristics, and for various combinations of the heuristics. I would’ve also liked to the see the paper name specific research and commercial methods that are leading solutions for semantic matching.
This paper titled "Why your data won't mix: Semantic Heterogeneity" is an interesting insight into the state of semantic heterogeneity, examples and concerns in 2005; ten years ago. Different applications that use the same type of data will inevitably have different ways of representing the same data and the relations among this data. When data needs change or data is shared in applications or businesses the schema need to be matched with semantic mappings. This paper answers four important questions. When is this a problem? Why is it hard? What do people do now? What will people have to do in the future?|
This is a higher level paper. The author doesn't have any new algorithm or experiments. It is a paper about the field of semantic heterogeneity, what is done now and what will have to happen in the future. The clear descriptions of the present and future and the key insights into the advancement of the field are among the papers strengths. This is an interesting problem because it does not just manifest in the field of databases but rather in any system that shares data. This is a strong paper that is hard to criticize because it is short and is more of a review of other work. The biggest downside I would state is that it is a little too high level and does not include descriptions or comparisons of specific systems.
This paper describes the extreme of having a fully automated way to match schema. This is an incredibly difficult problem. The author describes the machine learning approach to learn from matched schema in order to automatically map new schema. The key ideas that the author is contributing near the end, and another strength of the paper, are that although we cannot accomplish this very difficult task we could provide a step in the right direction by providing users with more useful data visualization in schema matching programs, and that we have to think about a 'dataspace'; that is, data in a larger context. The review of the problem, state of the art, and the addition of issues looking forward in semantic heterogeneity make this an important paper.
Part 1: Overview|
This paper presents the hardness of resolving semantic heterogeneity and points out some crucial facts in solving this problem. For example, Enterprise Information Integration needs to merge data from different databases in order to provide a “single view of customer”. Two leading approaches are used for EEI problems, data warehousing and building customer solutions. Reconciling semantic heterogeneity is key for a variety of data sharing architectures, to be specific, semantic mapping. However in practice, specifying a semantic mapping may become a key issue. There are tools provided by the modern EEI products. Another application would be querying and indexing the deep web. Some data on the web hides behind databases and could only be discovered by retrieving the tables. There are, in the same way as EEI products, a bunch of ways to design the storage data models.
The difficulty comes from both the domain of the business model as well as the technical structure. Some same schema may have different names between databases, for example the “IssueDate” and “OrderIssueDate”. However machine would not know they are the same. One schema may be some new version of the other. Two schemas may have same origin schema. We may have a set of sources that cover different domains but overlap at the seams.
Part 2: Contribution
This paper discovers the hardness of resolving semantic heterogeneity and shows to the readers the key reason behind the problem. The complexity of the business models is actually one crucial source of semantic heterogeneity and is inevitable as the real world problems are indeed complex.
This paper decomposes the problems to reconciling heterogeneity from thousands of webs. They provide some useful heuristics such as schema element names, data types, data instances, schema structure, integrity constraints.
Part 3: Drawbacks
It is stated in the paper that it is impossible to have a universal standard. At the end of the conclusion, they argue that in the future the semantics would be more and more complex and so that there would be no way to manual matching. Also the paper just mentioned that it is impossible to have a universal standard. I am not so convinced that this is true.
The paper discuss about resolving semantic heterogeneity for data sharing applications. In a data sharing architectures, reconciling semantic heterogeneity is a key. This problem is usually addressed by semantic mappings, which are expressions that specify how to translate data from one data source into another in a way that preserves the semantics of the data, or alternatively, reformulate a query posed on one source into a query on another source. However, the key issues is the amount of efforts it takes to specify a semantic mapping. While many tools are available to specify this mappings, they requires an expert to specify the exact mapping between the two schemas. |
The paper discussed about the reasons why semantic heterogeneity is difficult to deal with. The reason that reconciling semantic heterogeneity is so hard is because the data sets were developed independently, and therefore varying structures were used to represent the same or overlapping concepts. In many cases , the data systems are designed for different business needs.
There are different kind of problems that need to be understood in order to solve semantic heterogeneity including the fact that one schema may be a new version of the others and the two schemas may be evolutions of the same original schema. Understanding these aspects of the problem allows to find solution on a specific problem of semantic heterogeneity.
In addition the paper mention different solutions. The paper mentions that solving semantic heterogeneity is a heuristic, human assisted process unless there is a strong constraint in the schema of different data systems. For example, one heuristic is using schema element names. By looking at the names, possibly stemming the words first, it is possible to obtain clues for the schema matcher. In addition, the paper addresses different solution in terms of historical research addressing these issues.
The main strength of the paper is that it discusses in depth why solving semantic heterogeneity is difficult. In addition, it exhaustively discuss different research in these areas.
The main weakness is that there is so no research components or analysis of the different works. Most of the discussion is high level comments on different works.
This paper provides an overview of the topic: semantic heterogeneity. It answers three questions. First, why is it so important to handle semantic heterogeneity or stated in another way, why are there so many scenarios where we have to have semantic heterogeneity and why we have to handle that by mapping different schemes? Second, why is it so difficult to do such mapping? Is it possible to do it automatically or do we have to do it manually? Third, what are the solutions and opportunities here? Solving semantic heterogeneity is crucial for building data sharing applications.
First, there are a lot of scenarios where we have semantic heterogeneity. For example, in order for an enterprise to obtain a "single view of customer", they must tap into multiple databases. Cases like this inevitable because data comes from different resources. Especially for search engines that aggregates informations from different resources, to understand thousands of millions of different schemes is inevitable and yet impossible.
One might question that, why would having a universal standard is not possible? It is hard to have a universal agreement on one fixed and hard standard. Making the standards flexible is essentially the same as having different schemes.
Currently, we have two leading approaches: data warehousing and building custom solutions. The first one still don't go beyond enterprise boundaries and the second one is expensive and hard to maintain. We have to do semantic mappings. The author points out that having computers assisting human do the mappings effectively and correctly is the way out. We can use good user interface and machine learning techniques to first, let people focus on the hard parts of the mapping and second, let people leverage past experiences by providing suggested mappings. Machine learning schema matcher to leverage previous experience is helpful.
At the end, the author also mentioned that, in the future, the schemas to match will be larger and more complex, which makes it impossible to do the visual matching manually. We have to come up with some novel ways of handling this problem of greater complexity.
This paper provides a complete overview of the semantic heterogeneity problem. It analyze the reasons why this problem is essential and why solving this problem is critical. I like the examples provided in this paper, which support the idea of this paper very well.
The paper just mentioned that it is impossible to have a universal standard, but I think a little more examples of how such standards failed are desired. Actually, I think having a novel standard is the future. Maybe we can have different level of standards so that even for the same thing, people have different options to use light weight or complex scheme. Transforming between schemas will be easier then.
This paper provides an overview of the heterogeneity problem of different databases semantics. |
The problem of heterogeneity of semantics is inevitable because of slightly different situations people face and the fact that different people think differently. Some interesting and important example this paper gives are: 1) Enterprise data; 2) Deep Web data, which means data that is accessible only through filling forms; and 3) Merchant catalog.
There are several levels or forms of heterogeneity:
1) Schema heterogeneity: different database might have different schema name for same set of concepts.
2) Data heterogeneity: Same object is referred using different name. For example, IBM and Information Business Machines.
3) Semi-structured data: This is essentially schema heterogeneity, but the fact that any new tags are allowed to be added makes it different because variations are more likely appear.
Solving heterogeneity is hard because the mapping between one schema and other might no be one to one. An element in one schema might map to none, one or a set of elements in other schema. This makes this problem exponentially difficult.
Currently what people do is to develop aiding tools for this mapping. These tools provide suggestions to domain expert to assist this mapping process. Heuristics are built using knowledge from: 1) Schema element names; 2) Data types; 3) Data instances; 4) Schema structure; 5) Integrity constraints. I think any property that an element has could theoretically be used.
Also this paper introduced a new trend in this research area, which is machine learning. This mapping process can be seen as an learning process when you try to merge a large set of similar schema.
Finally, author shared his prediction for future challenges: larger schemas and more complex data sharing environments.
Written in 2005, I think this paper made a very good introduction for this research area. It covered the source of this problem, why this problem is hard, how people currently solve it and future challenges.
The author thinks that one future challenge will be larger schemas, which contain thousands of schemas. 10 years has passed since, but this did not become true. Also semi-structured schema did not become everyone’s choice. But other than that, I think this paper has done well in for the purpose of introduction.
This paper addressed the problem of semantic heterogeneity – the differences between database schemas for the same domain developed by independent parties and some of the possible solutions for the same.
When trying to merge two data sources, if they belong to the same domain, the different sources need to be reconciled by semantic mappings i.e, mapping same fields with different naming conventions but referring to the same object to each other. Semi-structured data gives this independence of being able to add attributes at will and making the data structure suited to the current purpose, even though using a fixed data schema throughout would solve this issue, the flexibility of semi-structured data is hard to overlook. Few of the heuristics used for schema matching are schema element names, data types, data instances and schema structure. All of these heuristics taken together result in a relatively robust solution unlike when considered individually, the solution may not be good enough.
One of the important thoughts of this paper is that only the attributes that are going to be used across systems can be mapped right at the beginning thereby reducing the size of the problem significantly considering the usage of heuristics defined before. Another idea that is proposed using all the previous schema definitions and semantic mappings available to train a machine learning algorithm in order to partially automate the mapping procedure leaving only the most ambiguous and difficult mappings to the expert. This paper definitely addresses a lot of the ideas about semi-structured data and related schema differences and draws conclusions that are worth being considered.
One of the things that this paper really lacked was results from the www.everyclassified.com site or examples of the machine learning algorithm being able to identify the similarities between two different schemas. Also, there is also the consideration that if the schema matching algorithm does not necessarily work as perfectly even for the few data fields, it might result in increased workload for the expert which makes you wonder if manual semantic mapping is the way to go until a good working solution is found.
Theoretically, the idea seems like an idea worth exploring but as the author himself puts it, the problem is almost close to intractable as a completely automated scenario.
Why Your Data Won't Mix: Semantic Heterogeneity paper review|
In this paper, the author introduced the problem of semantic heterogeneity, which refers to the situation when database schema or datasets for the same domain are developed by independent parties, resulting in differences in meaning and interpretation of data values. And there are majorly five source of semantic heterogeneity:
1)Enterprise Information Integration (EII): Due to multiple data sources in today’s enterprise data storage, one needs to merge data from many DBMS and legacy systems to create an single view.
2)Querying and Indexing the Deep Web: data source of deep web content are often not shown in the surface web, and crawler needs to understand different schema of the forms to understand the deep web data when accessing it.
3)Merchant Catalog Mapping: Different supplier carry their own catalog for each item they are supplying, but more often the buyer requires to transform all of its supplier catalog to his own catalog format.
4)Schema versus Data heterogeneity: not only just the schema varies, but in different system, the value for the same attribute can be different, one also needs to understand the variation of the same value.
5)Schema heterogeneity and semi-structured data: sometimes the source data themselves are not structured, so the challenge will become retrieve and reconcile needed attribute from unorganized data source.
The author states that it is hard to reconcile schema heterogeneity because that, most importantly, the data sets are developed independently, and hence varying structure are used to represent the same concepts. Hence, when one is trying to solve the problem, it is important to know that, 1)one schema may be a new version of the other, 2)the two schemas may be evolutions of the same original schema, 3)we may know that we have many sources modeling the same aspects of the underlying domain (horizontal integration) and 4)we may have a set of sources that cover different domains but overlap at the seams (vertical integration).
As another major contribution, the author introduced that the state of the art methods for resolving schema heterogeneity are based on heuristics. He mentioned that, in the schema matching process, it is very important to utilizing Schema element names, Data types, Data instances, Schema structure and Integrity constraints together to predict the possible pairing candidates. But unfortunately, even using those heuristics, the reconciling result is not ideal, and nowadays, the industrial giants are mostly providing human assisted matching tool with visualization to help finish the job. At last, the author states that it is very important to learn from the past schemas to help predict the meaning of future ones and hence result in a better matching.
To sum up, this paper serves as a very good introduction to the current and future trend of technologies in solving Semantic Heterogeneity. And in many ways, the author presented why the current methods used are limited in solving the problem. However, in my humble opinion, there is still a fundamental drawback for this paper: even for this kind of introductory scientific paper, it is not reasonable to just providing reasoning to convince the reader of the limitation of an engineering method. To better explain why the current technology for semantic heterogeneity reconciliation doesn’t work well, the author should have conducted tests in terms of the accuracy and cost in finishing the reconciliation job.
This paper discussed in depth on several aspect of resolving semantic heterogeneity: its motivation, difficulties and current solving strategies:|
Semantic heterogeneity is often referred to the different schemas developed by independent parties but to describe the same domain of information. In many scenario, like the enterprise information integration(EII), where enterprise needs a single view of the user from multiple databases, semantic matching is required to solve the heterogeneity.
Difficulties of the problem is that the structure of semantics varies even when they are representing the same concept. In addition, it is challenging to make the machine understand the meaning in a semantic, which makes a matching difficult or laborious
The author pointed out that the entirely autonomous schema matching is unfeasible and the goal should be reducing the time for human expert in matching a schema. Several heuristics are used to help, such as schema element names, schema structure and etc. In addition, the current commercial semantic mapping tools offers only visual interface for user to specify matched structures but without any recommendation for matching. In research area, researchers have been trying to leverage from the past experience. Machine learning is widely studied in this area to group the concepts and give matching suggestions based on previous matching result.
This paper gives a blueprint. The problems for solving semantic heterogeneity is summarized in this paper and the author points out several possible directions for future research. The paper shows the difficulties in essential issues of developing matching tools and current trend in research area. In addition, the author points out the challenges that might be encountered in the following research: larger schema searching and data space management.
Though the paper well explains the motivation to implement machine learning, it fails to mention current work in this method and a statistical evaluation or comparison for them. Machine learning seems the only solution the author brought in the paper and he only brings us an general intuition but lacks the scope of what this method is actually implemented and what phase the current research is on.
This paper discusses the primary problem in mixing multiple data sources: resolving schema differences. Data schemas differ for a multitude of reasons and the authors identify five scenarios where heterogeneous data must be combined:|
1. Data from multiple databases within a business need to integrated.
2. Retrieving data behind web forms for indexing.
3. Aggregating catalog data from multiple merchants (similar to 1 but data sources are external).
4. Data heterogeneity i.e. different representations for the same concepts.
5. Semi-structured data sources allow for greater flexibility than relational database schemas and so they also contain large amounts of variability.
The authors summarize existing techniques for creating schema mappings and suggest that the problem instead be approached by using past experience to develop a set of concepts (with variations of their representations), relationships, and soft constraints. To create mappings for large schemas, rather than visualize the problem as lines between to lists of fields, the authors suggest presenting the interface as a search engine which ranks elements in the target schema based on how similar they are to the current element, allowing the user to select the correct one.
This paper provides uninformed readers an easy-to-understand overview of the problem of creating mappings between data schemas, current approaches, and possible future solutions. This paper was able to justify why the problem is hard and why solving it is necessary. Unfortunately, while the solution the authors present of using past experience is attractive, the authors do not justify their proposed attempt at modeling the problem; considering how difficult the problem appears to be, without an implementation that has demonstrated strong performance it is difficult to believe that this approach will work. Lastly, towards the end of the paper, the authors move into the tangential topic of managing dataspaces which felt out of place.
This paper describes when we will see semantic heterogeneity in database applications, why this is a difficult problem to solve, and what some of the current approaches are to solving this problem. Semantic heterogeneity can come from many different sources. One of these is integrating many different database systems for an enterprise. Customer information may be spread out across multiple databases and companies may acquire more data sources as a result of mergers and acquisitions. Another source is the deep web because many search engines do not crawl these databases and it is difficult to aggregate all of these data sources as there is no standard schema. Finally, there is catalog mapping where a website has to map products to the merchants that sell them, but the merchants may provide different schemas to the website.|
One of the reasons that schema heterogeneity is difficult is that you need to understand both the business meaning of the schema and also how to transform certain schemas into another schema. This problem is even worse for programs because the computer is not able to tell the intentions of the schemas. For example, a computer will not be able to know that “IssueDate” and “OrderIssueDate” are for the same column. Finally, we need to take into account vertical integration where schemas overlap, but one contains information that another doesn’t.
The following are the positives of this paper:
1. The author collects all of the state of the art projects in this area and presents it in one of the sections. It also contains all of the heuristics that are being used by these systems.
2. I like how the author also looks at what we can learn from approaches used in the past so that we do not repeat the same mistakes and that we can take the positives from the past as well.
Overall, this is a great paper that merges all of the difficulties of schema heterogeneity together, but I still have the following concern about the paper:
1. The author does give all of the state of the art ways to resolve schemas, but what are the top companies like Google, Facebook and Microsoft using to solve this problem? Are they just committing engineers to fixing these schema heterogeneity problems or do they have proprietary software that does it for them?
This paper discusses the difficulties that semantic heterogeneity introduces when trying to integrate data from multiple sources or applications. Companies today often have data spread between multiple database systems, legacy applications, XML files, etc. In trying to present a unified view of this data, it becomes necessary to find a common interpretation of multiple schemas. Merging these schemas is very difficult, as fields that may represent the same data often have different names, different input formats, and different locations within forms. Therefore, an automated solution to this problem requires some degree of natural language processing and interpretation of what the schema designers intended the data in each field to represent.|
Improving our ability to automatically resolve semantic heterogeneity is a key area of research for several reasons. First, it is expensive and time-consuming to have employees compare different schemas and create custom mappings for each set. Programs that suggest mappings between schemas are critical resources, especially since many schemas are far too large to compare manually, often spanning thousands of columns. Second, the ability for machines to understand heterogeneous data is critical to indexing the deep web. Many sites hide the vast majority of their data behind forms, providing access to content through a simple lookup interface, rather than creating a multitude of sorted or hierarchical pages to display everything. Because the content of these sites is dynamically generated by lookups, it is inaccessible to web crawlers that cannot understand the form. This prevents much of the content from being indexed and can make discovery of these sites very difficult for users.
The current state of the art for mapping schemas takes advantage of several clues relating fields in one schema to corresponding fields in another. Fields are typically matched based on a combination of similar element names, similar data types, similar data ranges, and similar integrity constraints. While these clues are typically not enough to give an exact match for every piece of a schema, they are useful in generating suggestions that can be accepted or rejected manually. Some principles of machine learning are also being leveraged in an attempt to allow schema matching programs to learn from past data and refine their suggestions based on a body of evidence.
My chief concern with this paper is that it offers few suggestions about how the issue of mapping heterogeneous schemas can be improved in the future. While they do a good job of summarizing the current state of the art and listing a few ideas that are in development, they don’t really contribute anything of their own. Additionally, I don’t think they accurately represented how difficult matching these schemas can be. They state that schemas can be large and that only the designer really knows what each field was intended for, but I felt as if they could have provided more examples of how ambiguous the meaning of different fields can be and how much of a struggle it really is to resolve these issues by hand.
This paper talks about the problem of semantic heterogeneity. This problem appears in the presence of multiple XML documents, web services and so on. Whenever there is more than one way to structure a body of data, we would need to deal with semantic heterogeneity. This paper reviews several scenarios in which semantic heterogeneity is crucial, and explains why this problem is difficult. Then, the paper reviews some recent research and points out the key opportunity in this area.|
First, the paper describes some scenarios in which semantic heterogeneity is important, including enterprise information integration, querying and indexing the deep web, merchant catalog mapping, and semi-structured data. In enterprise information integration, in order to obtain a specific view, the enterprise has to access multiple sources. Queries are translated on the fly to appropriate queries over the individual data sources. In deep web, the web content resides in databases and is accessible behind forms. Therefore, the paper first lists some scenarios to emphasize the importance of semantic heterogeneity.
Second, the paper discusses about the reasons why semantic heterogeneity is hard, and also talks about some recent research. The fundamental reason that makes semantic heterogeneity so hard is that the data sets were developed independently, and so varying structures were used to represent the same or overlapping concepts. This problem requires both domain and technical expertise. One main approach dealing with semantic heterogeneity is leveraging past experience. We can learn domain concepts and their representational variations, relationships between concepts, and domain constraints. In addition, the paper talks about the future ways, including larger schemas, schema search, and managing dataspaces.
The strength of this paper is that it provides complete introduction about the problem of semantic heterogeneity. From the motivations, scenarios, and recent research. Most importantly, it describes several real life scenarios to emphasize the importance of this problem, which can help illustrate its idea.
The weakness of this paper is that it does not propose any new methods dealing with semantic heterogeneity. It is more like a summary of this important problem. It would be better if it can propose some new approaches to deal with some difficulties mentioned in the paper.
To sum up, the paper talks about the motivation for the problem of semantic heterogeneity, and explains why this problem is difficult. In addition, the paper reviews some recent research and points out some opportunities in this area.
This paper is an introduction into semantic heterogeneity, why it is an important issue, why it’s hard, and what current techniques are. Semantic heterogeneity is basically different schemas or ways to represent the same data. This is a problem when different independent parties build something over the same data and those two parties need to interact with each other using that data. |
The paper lists some common times this happens, some of them are:
1. Enterprise Information Integration: an enterprise wants to build one single view of a customer based off of many different databases, some of which are covering the same material. This can also be when a company wants to present one external form of their data to others, which requires merging multiple databases.
2. Querying and Indexing the Deep Web: “The deep web refers to web content that resides in databases and is accessible behind forms”. Much of this data is not crawled because crawlers don’t understand the fields and get confused. Also website designers can design things in many different ways.
3. Merchant Catalog Mapping: When a merchant tries to aggregate many different product catalogs into one larger one. These catalogs can be stored using differing formats.
4. Semi structured data: This is the most common form of this problem because most forms of semi structured data are built for the purpose of sharing between parties.
What is difficult about this issue:
1. It used to always be done by humans which is always a problem because it’s slower and costs more.
2. Same schema can have different names (making it hard for a computer to match)
3. The same data can be represented in many different (yet still similar) ways with different schemas (again making it hard for a computer to match).
4. One schema may cover aspects of the data that another schema may not cover, so it’s not a one to one match. And it is not the case that one schema is a subset of the other, they could each have common data and things that the other doesn’t’ have (think of a venn diagram between two schemas)
Current strategies of mapping two schemas to each other:
1. See if the element names match each other directly (but this is hard because use of synonyms is common).
2. See if the elements have the same data types and values. Likely if they are covering the same data they will have the same underlying data type and will share common values if there are any.
3. Integrity constraints and object hierarchy are also good triggers. If two elements have the same (or relating) parent element in different schemas they are likely the same element. Also if something is an integrity constraint in on schema it is likely that the integrity constraint of another schema over the same data will be corresponding to that element.
The paper proposes a solution that is essentially machine learning where they build a system that learns from past experience of mappings done by humans to hopefully predict with better accuracy mappings moving forward. It can learn based on the things that were listed in the section directly above.
Overall I think this was a great paper introducing semantic heterogeneity and its problems and potential solutions. I think it is somewhat weak in the solutions it lays out as they are very high level ideas that haven’t’ been fleshed out at all, but for an introduction paper I think it is sufficient. I think this was a great paper, definitely worth the read!
The purpose of this paper is to review the state of semantic heterogeneity — the case where data is architected differently within the same domain. In particular, this paper reviews why resolving these differences is important in creating fluid cross-platform applications. However, this resolution is difficult, and there is an active area of research dedicated to resolving contentious semantic heterogeneity, which the paper also goes over.|
Enterprise level systems typically have a wide variety of data sources grown over time, built either because needs that were once singular or disjoint called for such databases or because data between merging companies has to be combined through time. Because of this, many different sources must be queried to achieve what should be one modular “chunk” of data. XML platforms were typically adopted in enterprise systems based on their flexibility and extensibility. However, because of this, much manual work needs to go into specifying a “semantic mapping” which details exactly how enterprise-specific data is structured. In addition, the specific problem of indexing and querying the deep web is two-fold: first, actually accessing web data content is difficult because, if data is behind a form, it is typically not accessed by a search engine. Second, the wide variety of databases that can exist on the web changes the order of magnitude of the semantic heterogeneity problem in itself. They also mention the problem in product merchandising (how does one map items in a catalog if they can possible belong to multiple categories to a merchant but different categories to a retailer?), data entropy (the same data being represented by different types, or even logically being represented differently e.g. IBM vs International Business Machines), and semi-structured data (i.e. how should attributed be interpreted if they are only loosely specified).
There are many reasons that contribute to the vastly different structure of db’s and semantic heterogeneity. The first, and most intuitive, is that it is a byproduct of human nature; different people will invariably think about the same problem in at least minority different ways. One needs to be familiar with business-level needs in order to speak fluently in the language of semantic heterogeneity, as well as technical skill in order to know what is feasible (not to mention translating this knowledge into an automated system). In addition, there are a myriad of other reasons that data might be heterogenous: schemas may be updated snapshots, schemas may be divergent from an original structure, schemas may be multiple different ways of representing the same latent information, or schemas may represent different underlying information that is not entirely disjoint (i.e. overlap).
Resolving semantic heterogeneity is, as stated, an inherently heuristic task, and this section of the paper describes several features that may be used for heuristic heterogeneous data matching: schema naming, data type matching, data value matching, integrity constraints, and (this seems to be weakest) considering schema structuring.
I found most of this to be very interesting, and it seemed very much like a Stonebraker in that it was structured as a review of the state of things, with an educated response about how these ideas come to play in industry. However, I am a little bit concerned about the last part that discusses “learning from past experiences with machine learning.” This part is very generally describing how machine learning finds patterns in information and how that general concept can be applied, but there is not a lot of discussion on how involved and complicated this feature training is, or any discussion of relevant work in any of these areas. I personally think this section should either have a more informed/relevant view on machine learning applied to semantic heterogeneity, or should not be here at all (the idea about a “schema search engine” proposal really feels like when my mom says “you know what would be a good app?”). But otherwise the review of problems in the state of semantic heterogeneity is very thorough.
The paper talks about the cause and effect of semantic heterogeneity, the existing solutions, and possible future solution. The motivation behind it is because currently there are growing numbers multiple type of data sources due to applications of the same domain being developed by different, independent parties.|
The paper begins by explaining situations where semantic heterogeneity can arise: Enterprise Information Integration (due to number of specialized application and data from mergers/acquisitons), querying and indexing the deep web (due to design of the website that affects the placement of the data), merchant catalog mapping (due to data exchange between a merchant and a distributor), schema vs. data heterogeneity (due to difference in data values), and lastly, schema heterogeneity and semi-structured data (due to the nature of semi-structured data). Next, the paper explains why resolving semantic heterogeneity can be so difficult (not as simple as enforcing a “standard schema”) because one schema may be a new version of the other, the two schemas may be evolution of the same original schema, horizontal integration, or vertical integration. Next, it elaborates a little on the existing solutions by explaining the heuristics used to resolve heterogeneity: schema element name, data types, data instances, schema structure, and integrity constraints. These heuristics cannot be used in isolation. From there, it continues to discuss emerging solution: leveraging past experience. Last, the paper takes us to consider possible solution in the future, in regards of the growing data size and even various kinds of data sources. It mentions briefly about woogle, a web service search engine. It also mentions things that can be learned from the past: domain concepts and its representational variables, relationship between concepts, and domain constraints. Two major challenges for future development are dealing with larger schemas and dealing with vastly more complex data sharing environment. Facing this, the author emphasize on possible solutions: schema search engine and dataspace management.
One of the contributions of this paper is it helps highlight the problem of semantic heterogeneity in a holistic manner. This paper leads us to actually think about semantic heterogeneity problem as a general problem, not an isolated problem exclusive to several applications. It summarizes pretty well the states of the art of the heuristics used to resolve the problem and, more importantly, it guides us anticipating where and how the problem of multiple data sources could develop and what could be a better solution by then. Overall, this paper serves as a good start when trying to resolve semantic heterogeneity problem as a whole.
However, I think this paper sounds more like a speculation than a real solution. While it does point out many things that can be learned from “the past” and it mentions one example (Woogle), in the end there is no concrete solution that we can derived ourselves from (since Woogle is for web services, not schema). It would be more helpful if the paper also points out the possibility of building such solution using the existing heuristics (i.e. which heuristics combination to use and how). It also does not consider the hassle of collecting schema. Unlike web service, schema is not something that can be crawled over the internet.--
The purpose of this paper is to provide a survey of the schema matching and the semantic heterogeneity problem. It finishes by offering some suggested future paths to explore to better solve this problem. |
Though this is more of a survey paper than anything else, it does do a good job of presenting the problem of schema matching. It lays out various heuristics that existing methods use (often in combination) to perform schema matching between disparate data stores. It also motivates and establishes this problem by consider some real world cases where this problem comes up and thus the motivations to solve it. finally, the paper ends with the author’s thoughts on how this problem could better be solved. He mentions several more expansive areas to consider as well as a dataspace concept to allow for better data management and the capturing of more complex relationships not just between data, but between the different participants in the related data stores.
As far as strengths go, I think this paper does a good job of establishing that a problem exists, covering the methodologies that currently exist to (partially and perhaps not very well) solve this problem. The author then states the main takeaways from currently implemented heuristics before talking about his ideas for the future of this field, which I think are insightful.
However, I think this paper has several weaknesses. It seems as though the author’s ideas for change should have been a more major component of this paper. The are tacked on at the end,and the paper lacks a conclusion section. The reader is left completely hanging with not even a paragraph to wrap up the paper. There is no summary of the flow of ideas presented that result in the author’s contributions and without this summary for the reader, I think the potentially powerful and certainly insightful ideas the author presents for the future of this field fall quite flat.
Paper Review: Why Your Data Won’t Mix: Semantic Heterogeneity|
This is a review paper that focusing on the discussion of the issue of Semantic Heterogeneity. Semantic Heterogeneity refers to the issue that when data structures have multiple possible representations, different parties of a database system can be implemented with different schemas and thus leads to consequences. This paper starts with reviewing some common scenarios where the issue can be server followed by some recent development of solvers to this problem. At the end the paper provides some opinions on the open problems and opportunities in this field.
My understanding from this paper is that there are two types of Semantic Heterogeneity, which are extrinsic Semantic Heterogeneity and intrinsic Semantic Heterogeneity. The extrinsic S.H. comes from the design and implementation of database schemas while on the other hand the intrinsic S.H. comes from the data itself, for example data from difference sources could be referring to the same objects in different ways.
The difficulty of solving the issue comes from the fact that data sets are developed independently, which results in varying structures being used to represent the same or overlapping concepts.
A weak point (possible weak point) is that the paper itself doesn't offer any novel ideas or methods or models. It is only focusing on providing an overview of the issue and some reviews of what have been done elsewhere.
In section 5 the author claims that one emerging solution is to leverage past experience. The description of these directions sounds much similar to the concept of domain adaptation problem in machine learning literature. Recently developed machine learning algorithms have achieved significant success in learning highly abstract representations of data from varying domains and matching them in the abstract representations. Doing it in this way helps the machine learning model learning to ignore irrelevant properties and to extract relevant properties to workout the matching. I think this may be a potential direction for the research in solving S.H.
This paper mainly discusses several aspects of the data heterogeneity. It explains why the problem is both very important in industry and in research area since increasing heterogeneous data source incorporation is becoming common in large systems in Companies, organizations, etc. Moreover, the problem is very difficult since different schemas are developed for the data in the same domain. It concludes that the semantic heterogeneity is mainly divided into five major categories, enterprises information integration, querying and indexing the deep web, merchant catalog mapping, data heterogeneity itself, and the last one semi-structured data with no coherent fixed schema.|
This paper also introduces the two steps of reconciliation process: 1. find the correspondence of pairs of elements that refer to the same objects, and 2 create schema mapping expressions. The exploded schema elements are elements names, data types, data instances, schema structure and integrity constraints. Moreover, some machine learning technique can be applied to learn the schema similarity from data instances.
1. This paper gives a detailed and solid taxonomy of different directions in the research community that have already tried to solve the heterogeneous data source problem.
2. This paper uses several vivid examples to illustrate its ideas, for example, it use the fact that the students in authors' class design the same schema differently to show how people think differently and why the schema matching is a important area.
1. This paper only introduces the problem of heterogeneous data source and the current existing methods of solving it, but does not introduce any experiment results of comparing the existing methods or any new solution to solve the problem.
This paper discusses semantic heterogeneity. It’s common that different parties independently develops the database schemas for the same domain. Such data schemas could be largely different from each other due to the human nature of expressing the same idea in various ways. Those differences are called semantic heterogeneity. This paper firstly introduces a few scenarios where semantic heterogeneity comes out. One such scenario is merchant catalog mapping. For example, Amazon may describe a common product data schema for the merchants that list item on it. The merchants usually manage their product catalog in their own format, thus needing a tool to map the internal product data schema to that of Amazon’s. These use cases are common among large web merchandise retailers. Despite the strong need for commercial schema heterogeneity reconciliation tools, it is not a easy task to achieve. The fundamental reason is that the data sets were usually developed independently, and the same or overlapping concepts are represented by varying structures. The same attribute can be expressed with different words, or the same word may represent different semantics. The current solutions always involve large amount of human interactions.|
The state of the art solution of schema heterogeneity is heuristic, human assisted process. The goal is to reduce time it takes a human expert to map the pair of schemas, eliminate unnecessary information to help them focus on the most ambiguous part of the mapping. The process of reconciling semantic heterogeneity usually involves two steps, namely schema matching and schema mapping. Schema matching aims at finding the corresponding pairs of elements of the two schemas. Common solutions take the schema names, data types, data instances, schema structure and integrity constrains as heuristics. The schema matching provides a smaller problem space so that human can work on it with less time. The schema mapping is usually assisted by human. All these current solutions has limitations that they only exploit the evidence existed in schemas being matched. The author also pointed out that past experience can be leveraged to help future schema matchings. The previously manually constructed schema mapping could be used to provide insight into future tasks. In detail, a collection of pre-studied schemas are treated as a corpus. The goal of analyzing a corpus of schemas and mappings is to provide hints about deeper domain concepts and at a
finer granularity. The further problems of semantic heterogeneity involves dealing
with drastically larger schemas and working with vastly more complex data sharing environments.
The main strength of this paper is providing high level overview of semantic heterogeneity. The author used several concrete use cases to clearly illustrates the importance of dealing with semantic heterogeneity in data integration.
One drawback of this paper is that, the solutions of semantic heterogeneity are mostly heuristics, and the authors didn’t show any evaluation of such heuristics. It’s hard to see how effective those heuristics are.