Review for Paper: 36-Why Your Data Won't Mix: Semantic Heterogeneity

Review 1

This paper gives an introduction to the problem of database semantic heterogeneity, and some existing/trending solutions to it.

The paper first reviews common scenarios where resolving semantic heterogeneity is crucial. The main issue to resolve is, people need to access and analyze data from multiple sources, and building customized solution for each case would be too tedious. Another issue is accessing the deep web is a challenge for both content providers and search engines.

Then the paper discusses why it's hard to propose a solution. The fundamental reason is data sets are usually developed independently, and therefore varying structures are used to represent the same or overlapping concepts.

A state of art solution, according to paper, should include two parts: schema matching (find correspondences between pairs of elements), and creating actual schema mapping. There's also a trend in building a solution with past experience of what's going on with different schemas. Machine learning could be of great use to help learn from past experience.

I like this paper because it uses some good examples to show why semantic heterogeneity is an important issue for database content providers and consumers, why maintaining certain properties is hard, and what are trending solutions to this problem. For example, figure 1 shows a manual annotation of reconciling two disparate schemas, and it helps me understand potential issues when one is trying to map two schemas, makes the paper a lot easier to understand.


Review 2

Independent company, college, research groups are using their own database schemas and semantics which usually different from each other. This phenomenon is also called semantic heterogeneity. Semantic heterogeneity take place in many different places including XML documents, web services and ontologies. Semi-structured data is more vulnerable from semantic heterogeneity because they are more flexible. The paper aims at reviewing the insights of semantic heterogeneity including their importance, difficulty and current research progress.

Some of the strengths and contributions of this paper are:
1. The logic flow of this paper is clear. The paper shows semantic heterogeneity’s importance and difficulty, then presents the solution and stated future work.
2. The paper summarizes most of the scenarios that semantic heterogeneity may appear like query deep web, merchant catalog mapping and semi-structured data, which is helpful for people to be aware of the vulnerability in those fields.

Some of the drawbacks of this paper are:
1. The paper is not proposing a specific system or tool to solve the problem it raised.
2. There is no experiment conducted or data support for the problem, which make it hard to tell how important this problem is and how the solutions can improve performance or reduce overhead.
3. The paper is not providing some specific examples of semantic heterogeneity in real-world database schema.



Review 3

Semantic Heterogeneity refers to the differences in database schemas for the same domain, which appears in the presence of more than one way to structure a body of data. Resolving this semantic heterogeneity is crucial, and this paper reviewed several such common scenarios.

1). Enterprise Information Integration (EII)
Data in enterprises resides in multiple sources, which is problematic when it comes to share data between different parts of organization. Besides, enterprises acquire many data sources as a result of mergers and acquisitions. And two leading approaches were data warehousing and building custom solutions. The former accesses stale data and is not able to work across boundaries, while the latter is expensive, hard to maintain, and typically not extensible.
2). Querying and Indexing the Deep Web
For website designers, the way of modeling aspects of a given domain is too wide, and it is impossible to assume certain standard form-field name and structures as they crawl.
3).Merchant Catalog Mapping
In aggregating product catalogs, it is hard to figure out how to create mapping between merchants and retailers, and as there as subtle differences between product categories, there are multiple mappings that make sense.
4).Schema versus Data heterogeneity
Heterogeneity occur both in schema and actual data value.
5).Schema heterogeneity and semi-structured data
In many applications involving semi-structured data, the problem of semantic heterogeneity is not as serious as we think, and it is enough to reconcile only a specific set of attributes.

Then this paper summarized the reason why this problem is hard is that the data sets were developed independently, and therefore varying structures were used to represent the same or overlapping concepts. From a practical perspective, another reason is that it requires both domain and technical expertise. Next this paper reviewed recent progress in addressing this problem. The goal is to reduce the time it takes a human expert to create a mapping between a pair of schemas, and enable them to focus on the hardest and most ambiguous parts of the mapping. And the classes of heuristics which have been used for schema matching are as follows: 1). Schema element names, 2). Data types, 3). Data instances, 4). Schema structure and 5). Integrity constraints.

Finally, this paper pointed out two challenge areas: dealing with drastically larger schemas and dealing with vastly more complex data sharing environments, and summarized the key opportunities: 1).Larger Schemas and Schema Search and 2).Managing Dataspaces.

The main contribution of this paper is it not only emphasized the significance of the semantic heterogeneity in real industrial life, and summarized the state-of-art approaches, but more importantly this paper presented the lessons learnt from past experience of performing schema matching task, and this paper also pointed out a research gap or future scope in resolving semantic heterogeneity.

The main advantage of this paper is it is well-structured, and the authors gave an clear and systematic explanations, so we can form a basic idea about the existing knowledge on semantic heterogeneity after reading this paper.

The main disadvantage of this paper is this paper only focused on schema heterogeneity, and the problem of data value heterogeneity is still important when it comes to the web-crawling and data consolidation.


Review 4

Problem & Motivation
Nowadays, many applications have their own databases and schemas. In order for multiple data systems to cooperate with each other, they must understand each other’s schema. And this is a common problem within many scenarios. For example,
1. EII, the company may want to integrate many its sources
2. Deep Web, you want to transfer the key word into the one which matching the key word within the deep web. (recipe database vs. shallow recipe website)
3. Merchant Catalog, different merchant may tag itself with different names; yet, they may stand for the same thing and vice versa.

However, recent researches don’t improve the way to solve the semantic heterogeneity too much. The reason why this problem is so hard is the data sets were developed independently and controlled by individual users without a universe standard. It brings many disadvantages.
1. The same schema element in two schemas are given different names.
2. Attributes in the schema are grouped into table structures.
3. One schema may cover aspects of the domain that are not covered by the other.
And providing the standard schemas is limited success given the fact that the incentives for data provider is not that strong and it is an additional cost for them.

Contributions:
The author points out that the goal for solving semantic heterogeneity it to reduce the time for a human expert to create the mapping. The author summarizes the state of the arts techniques to solve the problem in several aspects:
1. Schema element names: element names carry information. The synonyms words may refer to the same entities. The con is the existence of the homonyms.
2. Data types: Same schema elements may have compatible data types. However, the data types are underspecified in many schemas.
3. Data instances: Values in the same format or with the same value may match each other very of often.
4. Schema Structure: If two classes match each other, then the children of these classes will also match.
5. Integrity constraints: if two attributes are known to be keys in their respective schemas, then that provides additional evidence for their similarity.

Then the author points the new trend leveraging the past experience by using the machine learning. He also gives two big challenges in the near future: dealing with drastically larger schemas and dealing with vastly more complex data sharing environments.

Drawbacks:
Mainly structure issues.
Many parts are not very well organized. For example, section 2 Schema versus Data heterogeneity and Schema heterogeneity and semi-structured data seems not the scenarios of semantic heterogeneity. And many ideas are repeated overtime. Like the idea of the same schema with different names.



Review 5

For better or worse, data storage needs are often handled by a variety of competing companies and organizations, many of which handle the same challenges in different, incompatible ways. In this case, database schemas in the same domain will usually differ, which is termed semantic heterogeneity, and it is exacerbated when systems are designed to handle semi-structured data, since they are designed to be more flexible to begin with. In order to have these systems work with each other, they must be able to understand each other’s schema, that is, this issue of semantic heterogeneity must be resolved. As this paper outlines, such a task can be quite difficult, but it does mention various efforts being undertaken in the field as well as the progress being made on this front.

It starts out by mentioning some areas where we might expect to face semantic heterogeneity. For example, large scale enterprises often have data residing in many different sources, from database systems to legacy systems to ERP systems to XML files and feeds, and many of their daily tasks will have to gather data from multiple systems to piece together the desired output. While this is a consequence of historical realities, such as independent business divisions devising their own solutions in a time where data sharing was not yet mandatory, as well as the result of mergers and acquisitions of other companies. Some solutions were implemented to deal with this challenge, like data warehousing and building custom solutions. Data warehousing often resulted in accessing stale data and could not work across enterprise boundaries, while custom solutions were expensive, hard to maintain, and hard to extend. There were also other solutions like EII, which queried multiple data sources in real time, but in any case, a key challenge is reconciling semantic heterogeneity. Enterprise data management is just one example of a situation where semantic heterogeneity is encountered; others include querying and indexing the deep web, and merchant catalog mapping, as in the case of Amazon’s e-commerce operations. Additionally, heterogeneity can be found in the data values themselves as well as the schema, such as in the different ways to name the same company (IBM vs. International Business Machines). When data is semi-structured, semantic heterogeneity is more complicated, because schemas for handling semi-structured data are much more flexible (i.e. more likely to have variations), and the variations in additional attributes, etc, increases, and understanding them is critical to correctness.

What makes semantic heterogeneity so difficult to resolve is that the structures that are developed to represent data sets (which can vary themselves) can vary wildly, even for the most seemingly straightforward domain spaces. It can be hard enough for humans to resolve these differences, not to mention how much more difficult it is for programs. Understanding schemas often requires both technical as well as domain expertise, the latter of which is harder to represent symbolically in a standardized fashion since businesses almost always have different requirements. One proposed solution is to use standard schemas, which fails in all but industries where there are strong incentives to create a common standard. Some of the state of the art techniques involve using heuristics to perform schema matching, which include the following:

- Schema element names
- Data types
- Data instances
- Schema structure
- Integrity constraints

They still rely heavily on manual specification, but provide some value in saving time for human experts to perform these mappings. One promising approach is to use previous knowledge in order to try to learn domain concepts and their representational variations, relationships between concepts, and domain constraints, among other valuable things that can help to resolve semantic heterogeneity.

The main strength of this paper is that it concisely summarizes the current situation, challenges, and state of the art approaches for semantic heterogeneity. While it does not provide any new research findings, it does serve a valuable role as a review article for people to quickly familiarize themselves in this area. Additionally, the paper proposes some future directions for resolving semantic heterogeneity that should be explored further, like working with larger schemas, developing a schema search engine, as well as working with dataspaces as opposed to just databases.

One weakness of this paper is that it is relatively brief in its treatment on the possible approaches being used to tackle this problem, with more space dedicated to explaining the problem and its challenges. In particular, it just mentions the use of various heuristics. It would be much more valuable to highlight some specific systems or research being done and do a more comprehensive review of what they are doing, as well as the merits and disadvantages of their approaches. Given that there has been a lot of work done in the field, as the paper mentions, it should be doable to include at least a few specific examples.


Review 6

This paper talks about semantic heterogeneity, which is the disagreement of schemas on the same domain of data from different data sources. The paper begins with an example of where semantic heterogeneity comes up in real life. The first scenario they used is Enterprise Information Integration (EII) where systems must tap into multiple databases to get a full view of data objects. Reasons for this in large companies were typically that the data sources were developed independently for some reason, and must later be shared by one part of the organization and integrated together into one system. EII solutions might query multiple data sources in real time, where queries are translated into the domain of each different data source and then the results are combined from partial results from each source. The paper introduces the idea of semantic mappings to reconcile the differences in schemas from different data sources. Semantic mappings are simply expressions that either translate data from one source to another while preserving semantics or translate a query on data from one source to a query on data from another.

The next scenario they introduce is querying and indexing the deep web, where the deep web is web data in databases. This data is much larger than the surface web and again challenges the content providers to integrate database data that are semantically the same, but represented differently in different domains. Next they talk about merchant catalogue mapping, where data from many different online merchants might be converted to a common schema that is the “best” for selling products. Then the paper distinguishes schema versus data heterogeneity, which is when two of the same products or data items can be referred to in two different ways, like “IBM” and “International Business Machines”. Then they speak on schema heterogeneity and semi-structured data, where they claim semi-structured data worsens schema heterogeneity because of the inherent flexibility of the schema, which makes it easier to disagree on schema as data is represented differently and added and removed.

The author then explains why this challenge is so hard, because the schema are coming from different sources and is even a hard problem for humans to reconcile these differences. But for a computer program the challenge is even worse because all it is given are two different data sources, with not too many insights on the semantics of them both. They then list four categories of heterogeneous schema: one schema is a new version of another, two schema are later versions of a common original schema, many sources are modelling the same aspects of a domain, and sources cover different domains but overlap in some way. The state of the art for schema heterogeneity reconciliation is actually a human assisted process based on a few heuristics that the article mentions, namely similar element names (synonyms/homonyms), schema elements that map to each other (compatible data types), similar data instances (values/patterns), similar schema structure, and known integrity constraints.

I did not like how at times this paper seemed to lack depth, where it almost seemed to repeat itself, especially around different scenarios of data coming from different sources and needing to be reconciled. I did however like how the article was upfront about how the state of the art required human assistance, which I’ve seen before in helping notice the best way of translating data into a common domain.



Review 7

Why Your Data Won’t Mix: Semantic Heterogeneity

This paper proposed solutions on how to merge database where the data is semantic heterogeneity. This happens a lot when you deals with your data aggregating. A little bit about semantic heterogeneity. When database schemas for the same domain are developed by independent parties, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity.

This article begins by reviewing a couple of very common scenarios in which resolving semantic heterogeneity which is important for building data sharing applications. First, Enterprise Information Integration (EII). This is important because in order for an enterprise to obtain a “single view of customer”, they must tap into multiple databases. Similarly, to present a unified external view of their data, either to cooperate with a third party or to create an external facing web site, they must access multiple sources. Second, Querying and Indexing the Deep Web. It is necessary because Deep web content is typically not indexed by search engines because the crawlers that these engines employ cannot go past the forms. Third, Merchant Catalog Mapping. We need to face the problems of creating mappings between thousands of merchants and a growing number of recognized online retailers. Forth, Schema versus Data heterogeneity. Fifth, Schema heterogeneity and semi-structured data.

Then this paper explain why dealing with the problems of semantic heterogeneity is difficult. The fundamental reason that makes semantic heterogeneity so hard is that the data sets were developed independently, and therefore varying structures were used to represent the same or overlapping concepts. From a practical perspective, one of the reasons that schema heterogeneity is difficult and time consuming is that it requires both domain and technical expertise.

This paper also review some recent research and commercial progress in addressing this problem and which uses the following information of schemas. There are Schema element names, Data types, Data instances, Schema structure, Integrity constraints. Also we can use from the past information but this method is premature. One of the difficult part is to find the related history. The information includes Domain concepts and their representational variations, Relationships between concepts, and Domain constraints. Finally, the paper points out the key open problems and opportunities in this area which is Larger Schemas and Schema Search, and Managing Dataspaces.

The contribution of this paper is it summarizes the several common scenarios in which resolving semantic heterogeneity is crucial for building data sharing applications and gives concrete explanations towards concepts.

The advantages of this paper is that it clearly explains the reason resolving semantic heterogeneity is hard. And it gives reviews of some recent researchs and commercial progress in tackling the problem. And this paper points out the key open problems and opportunities in this area.

The disadvantages of this paper is that This article focuses mostly on schema heterogeneity. The truth of matter is that heterogeneity occurs not only in the schema, but also in the actual data values themselves.


Review 8

In this paper. the author mainly talked about semantic heterogeneity. Semantic heterogeneity means that the difference in database schemas for the same domain. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with.

The paper first review some scenarios that the semantic heterogeneity could happen. Enterprise Information Integration : Enterprises today are increasingly facing data management challenges that involve accessing and analyzing data residing in multiple First, many data systems were developed independently for targeted business needs, but when the business needs changed, data needs to be shared between different parts of the organization. Querying and Indexing the Deep Web: The deep web refers to web content that resides in databases and is accessible behind forms. Merchant Catalog Mapping: a retailer accepts feeds of products from thousands of merchants, each trying to sell their goods online.

The the paper discussed why resolving the data heterogeneity is so hard. The main reason would be that data sets were developed independently, and therefore varying structures were used to represent the same or overlapping concepts. Another practical reason would be that resolve heterogeneity is difficult and time consuming is that it requires both domain and technical expertise.


Hence, semantic heterogeneity needs to be resolved at the step where the data provider exposes its data to its counterparts. Then tha paper talked about the state art solution. The goal is to reduce the time it takes a human expert to create a mapping between a pair of schemas, and enable them to focus on the hardest and most ambiguous parts of the mapping. Another idea is explored for the past few years in several academic research settings. The research considered the use of Machine Learning as a mechanism for enabling a schema matcher to leverage previous experience.

This paper mainly talk about why the problem of data heterogeneity is hard and how we should address this problem by ML technique. This paper also gives some good examples of the problem and scenarios that people would encounter.


Review 9

This paper discusses semantic heterogeneity, the problem of database schemas for the same domain being different because they are created by different parties. We briefly discussed this problem early in the semester when we were talking about XML. The main focus of the paper is on heterogeneity in schemas, although it is possible to find in data values as well. The primary way that this problem is solved is semantic mappings - mappings from one schema to another, which are traditionally provided by a human with significant domain expertise.

The paper discusses some application areas where semantic heterogeneity needs to be resolved, for instance online catalogs and exploring the deep web. Heuristics have been explored, such as schema element names, data types, data instances, schema structure, and integrity constraints, and the best systems have combined these together. However, most real-life applications still rely on manual schema mappings. One idea introduced is to leverage “past experience” (previous schema mappings) to create new schema mappings - this idea has recently achieved some success. Finally, future directions are introduced.

Probably the most interesting idea comes at the end of the paper, with the idea that we will face data management challenges for a “dataspace” not a database. In the world of big data and machine learning, the data itself is of increasing importance, and most of it is unstructured. Being able to make this data understandable in some way is very important, and recognizing this in 2005 was valuable.

This paper is short and does not directly introduce any new innovations. It primarily states the problem, shows what has been done in the past, and gives some ideas for the future. I think that it was presented well, but I am surprised that it has received so many citations, as nowadays I think something like this would be better suited to a blog post than a journal paper - at least if it hadn’t had a well known author.


This paper discusses semantic heterogeneity, the problem of database schemas for the same domain being different because they are created by different parties. We briefly discussed this problem early in the semester when we were talking about XML. The main focus of the paper is on heterogeneity in schemas, although it is possible to find in data values as well. The primary way that this problem is solved is semantic mappings - mappings from one schema to another, which are traditionally provided by a human with significant domain expertise.

The paper discusses some application areas where semantic heterogeneity needs to be resolved, for instance online catalogs and exploring the deep web. Heuristics have been explored, such as schema element names, data types, data instances, schema structure, and integrity constraints, and the best systems have combined these together. However, most real-life applications still rely on manual schema mappings. One idea introduced is to leverage “past experience” (previous schema mappings) to create new schema mappings - this idea has recently achieved some success. Finally, future directions are introduced.

Probably the most interesting idea comes at the end of the paper, with the idea that we will face data management challenges for a “dataspace” not a database. In the world of big data and machine learning, the data itself is of increasing importance, and most of it is unstructured. Being able to make this data understandable in some way is very important, and recognizing this in 2005 was valuable.

This paper is short and does not directly introduce any new innovations. It primarily states the problem, shows what has been done in the past, and gives some ideas for the future. I think that it was presented well, but I am surprised that it has received so many citations, as nowadays I think something like this would be better suited to a blog post than a journal paper - at least if it hadn’t had a famous author.


Review 10

This paper examines a common but hard to solve problem in the database field, that is semantic heterogeneity. Semantic heterogeneity arises when there’s more than one way to structure the body of data and the difference between these representations are referred to as semantic heterogeneity. The paper reviews common situations where semantic heterogeneity will occur, explains why it's difficult to solve the problem and provides several suggestions on this problem.

One of the most common scenarios is enterprise information integration, where an enterprise wants to access multiple data sources. The data systems are usually developed independently for the targeted business needs, so its hard to query all of them in a consistent way. Neither data warehousing and custom code solution provides a satisfying solution. One good solution is translating queries on the fly to appropriate queries over the individual data sources. This solution needs semantic mappings and in reality, this process is time-consuming, labor intensive and error prone. The problem is exacerbated when a company needs to deal with semi-structured data. The problem of reconciling schema heterogeneity is hard since it requires both domain and technical expertise. If we want a computer to solve this problem for us, it becomes even harder as it lacks prior knowledge.

There are two current trends for solving this problem. The first one is trying to reduce the time it takes a human expert to create a mapping between a pair of schemas and enable them to focus on the hardest and most ambiguous parts of the mapping. It utilizes attribute names, types, data instances, schema structure, etc. to find some matchings and leave the hard ones to human operators. It also offers visual interfaces that enable designers to draw lines between elements of disparate schemas and hide the details of mapping. Another Idea is utilizing past experiences. It uses machine learning techniques to learn from past matching experiences and predict a new schema map.

At the end of the paper, the author makes two proposals: schema search engine and dataspace. The schema search engine can take as input schema elements, fragments and return a ranked list of schema elements that are candidate matches. Dataspace is a set of individual data sources and relationships. It should be able to model any kind of relationship between different sources and support new semantic mappings as necessary.

In my opinion, this paper does a great job of explaining why the semantic heterogeneity problem will occur. However, in terms of the author’s proposition, I don’t know how it can help solving this problem. They are just some description and requirements of a system. Not how to actually build such a system.




Review 11

In the paper "Why Your Data Won't Mix: Semantic Heterogeneity", Alon Halevy discusses the concept of Semantic heterogeneity. Semantic heterogeneity refers to the differences encountered when database schemes for the same domain are developed by independent parties. Almost always, these schemes will be quite different from one another. This is prevalent in situations where data is unstructured such as XML documents, web services, etc. Thus, database schemes grow even more disjoint and it is getting harder for different databases to communicate with one another. This paper is a review of common scenarios where resolving semantic heterogeneity is vital to the function of applications. Furthermore, one could infer that this is quite a difficult problem - there are still no clear cut solutions which result in many open problems. Thus, this is both an interesting and important field to discuss.

The paper is divided into several sections:
1) Semantic Heterogeneity Scenarios: One common scenario is in Enterprise Information Integration. When trying to model and shape an image for a single customer, it is necessary to access multiple data sources in order to get the full picture. These sources may not be semantically consistent with one another due to business mergers or a change in business needs. Data warehousing and custom code are possible solutions, but are prone to access stale data or are expensive. EII was provided as a solution to compute queries on the fly and reconcile the different semantics with semantic mappings. These expressions act as a translation between data sources such that preserve the semantics between them. This leads into our second scenario, a programmer wants to do web scraping off the deep web. However, sources on the deep web do not adhere to standard web development semantics. Thus, it becomes a difficult problem since the number of semantic mappings that may be needed can exceed over 500! Our last situation is a very common occurrence at Amazon - dealing with mapping search queries to the name of products. One could imagine that there isn't a single correct mapping that always works out; usually products with the "best" name will be noticed and sell.
2) Why it is a hard problem: There are many overlapping concepts that simply have no unified model that all cases fall under. From the standpoint as humans, it is quite difficult to create a replica of a schema because we all think in different ways. It becomes magnitudes harder for programs to make sense of these differences. Thus, it becomes clear that semantic heterogeneity needs to be resolved at the step that data providers expose data to its counterparts.
3) The state of the art: A completely automated solution is not possible. The goal is to reduce time for the expert to create a semantic mapping so we can go on with our day. This is separated into two phases: schema matching and schema mapping. In order to further assist experts, we look at helpful information: Schema element names, data types, data instances, schema structure, and integrity constraints.
4) An emerging solution: It is becoming clear that schema mapping is very similar to past jobs that are done. Naturally, leveraging our past experience may boost the speed that we are able to resolve semantic heterogeneity. This is the power of machine learning. Using machine learning, there are three parameters that we care about: domain concepts and their representational variations, relationships between these concepts, and domain constraints. However, this is still in its infancy and does not have much work done on it.

Much like other papers, this paper had some drawbacks. Most of these drawbacks are derived from section 5, looking forward. One drawback that I noticed was when they claimed that current techniques are not useful for large schemes present in businesses. I disagree with this, especially since machine learning is a great tool to use when dealing with large database schemes. Companies like Amazon have figured out ways to deal with semantic inconsistencies, even if it means not being fair to all their products. The second drawback I see is Halvey's need for a schema search engine. While I feel that it is a good thing to use when a program is trying to resolve differences between databases, it not as good for a human expert who may be given a biased view of the attributes. Lastly, this isn't really a paper that discovered something new - it is simply a retrospective paper (it was a boring read).


Review 12

This paper describes the problems related to semantic heterogeneity among data sets. Semantic heterogeneity refers to the difference between schemas among different data sets that refer to similar data. This can make merging datasets difficult. A possible way to deal with this is with a semantic mapping, which maps each individual dataset to a common structure. However, this requires a mapping to be manually defined for each dataset.

One example of this issue occuring is with the deep web. Much of the data available is unsearchable, since it doesn’t have a common structure. Each database uses different schemas for its data, so a search engine can’t easily find data matching any set of input parameters.

A separate but related issue is that of data heterogeneity. This refers to the issue when the same data is represented in different forms; for example, a name can be referred to as it full name, or a shortened version, and they might not be easily recognized as the same data.

The issue of heterogeneity tends to be exacerbated by semi-structured data, since this has less defined schema, and so make it harder to find a common representation of different data types.

Differing schemas are hard to resolve effectively, since it requires knowledge of both the different schemas, and what each schema actually means. This makes it especially difficult to do programmatically, as individual programs don’t have knowledge about the true meanings of different data fields. As such, most methods of schema resolution and done by a human with the help of a programmatic heuristic. The heuristic often works by matching field names, data types, overall structure, or constraints. For greatest effectiveness, several of these should be used in tandem.

A useful way of matching schemas looks to what parts of schemas have been matched together in the past. This involves looking at the major concepts present in the schema’s domain, as well as the relationships between these concepts. This also is more naturally done by a human, but can be assisted by a machine.

The benefits of this paper are:
It’s easy to understand the problem.
It’s easy to understand why this problem isn’t easy to solve.
It makes a good case for human machine interaction, especially since most algorithms are entirely mechanical.

The downsides of this paper are:
There’s not as much work put into solo-machine schema unification.
There’s not very much detail on concrete implementations of schema unification.



Review 13

Some databases schemas are usually developed by independent parties and they are different from each other. This difference is called semantic heterogeneity. Generally, when there is more than one way to structure a body of data, there will be semantic heterogeneity in the data. This paper reviews common scenarios in which resolving semantic heterogeneity is crucial for building data sharing applications. This paper also explains why resolving semantic heterogeneity is difficult and review recent researches that addressing the problem.
First scenario that semantic heterogeneity is crucial is enterprise information integration. Enterprises need to access and analyze data residing in multiple sources. One example is that in order to obtain a single view of customer, they must get into multiple databases. Data from different sources are developed independently, and therefore there will be semantic heterogeneity. In this case, reconciling semantic heterogeneity is key. Typically, the semantic heterogeneity is reconciled by semantic mappings, which specify how to translate data from one data source into another in a way that preserves the semantics of the data. However, it takes effort to specify a semantic mapping.
Another scenario is querying and indexing the deep web. Deep web content is typically not indexed by search engines, which means there is no explanation for the data. Also, the magnitude of deep web content makes it more crucial. The challenge stems from the fact that there is a very wide variety in the way website designers model aspects of a given domain. Therefore, it is impossible for designers of web crawlers to assume certain standard form field names and structures as they crawl. Furthermore, accessing the deep web is also a challenge.
The third scenario is merchant catalog mapping. One example that semantic heterogeneity causes problem is aggregating product catalogs. On the back end, the data at merchant is stored in local schema, which is highly likely to be different from the one prescribed by retailer. The problem then becomes creating mapping between thousands of merchants and online retailers. Heterogeneity occurs not only in the schema, but also in the actual data values themselves. For example, there may be multiple ways of referring to the same product.
So why is it so hard to reconcile schema heterogeneity? One of the reasons that schema heterogeneity is difficult and time consuming is that it requires both domain and technical expertise. When reconciling heterogeneity from thousands of web forms, there are additional sources of heterogeneity. Some argue that standard schemas will solve the problem, however, experience shows that standards have limited success.
To solve the problem, people have tried building semiautomated schema matching systems by employing a variety of heuristics. The heuristics include schema element names, data types, data instances, schema structure, and integrity constraints.
The advantage of this paper is that it no only introduces where the problem occurs but also introduces some approaches to solve the problem and their limitations. However, it will be better to present approaches to solve the problem. It seems like this paper asked a lot of question without answers.


Some databases schemas are usually developed by independent parties and they are different from each other. This difference is called semantic heterogeneity. Generally, when there is more than one way to structure a body of data, there will be semantic heterogeneity in the data. This paper reviews common scenarios in which resolving semantic heterogeneity is crucial for building data sharing applications. This paper also explains why resolving semantic heterogeneity is difficult and review recent researches that addressing the problem.
First scenario that semantic heterogeneity is crucial is enterprise information integration. Enterprises need to access and analyze data residing in multiple sources. One example is that in order to obtain a single view of customer, they must get into multiple databases. Data from different sources are developed independently, and therefore there will be semantic heterogeneity. In this case, reconciling semantic heterogeneity is key. Typically, the semantic heterogeneity is reconciled by semantic mappings, which specify how to translate data from one data source into another in a way that preserves the semantics of the data. However, it takes effort to specify a semantic mapping.
Another scenario is querying and indexing the deep web. Deep web content is typically not indexed by search engines, which means there is no explanation for the data. Also, the magnitude of deep web content makes it more crucial. The challenge stems from the fact that there is a very wide variety in the way website designers model aspects of a given domain. Therefore, it is impossible for designers of web crawlers to assume certain standard form field names and structures as they crawl. Furthermore, accessing the deep web is also a challenge.
The third scenario is merchant catalog mapping. One example that semantic heterogeneity causes problem is aggregating product catalogs. On the back end, the data at merchant is stored in local schema, which is highly likely to be different from the one prescribed by retailer. The problem then becomes creating mapping between thousands of merchants and online retailers. Heterogeneity occurs not only in the schema, but also in the actual data values themselves. For example, there may be multiple ways of referring to the same product.
So why is it so hard to reconcile schema heterogeneity? One of the reasons that schema heterogeneity is difficult and time consuming is that it requires both domain and technical expertise. When reconciling heterogeneity from thousands of web forms, there are additional sources of heterogeneity. Some argue that standard schemas will solve the problem, however, experience shows that standards have limited success.
To solve the problem, people have tried building semiautomated schema matching systems by employing a variety of heuristics. The heuristics include schema element names, data types, data instances, schema structure, and integrity constraints.
The advantage of this paper is that it no only introduces where the problem occurs but also introduces some approaches to solve the problem and their limitations. However, it will be better to present approaches to solve the problem. It seems like this paper asked a lot of question without answers.


Review 14

“Why Your Data Won’t Mix: Semantic Heterogeneity” by Alon Y. Halevy gives an overview of the semantic heterogeneity problem, why it is challenging, state of the art heuristics for enabling some automation of schema matching, newer learning approaches (leveraging past/partial schema matchings), and major challenges going forward. Semantic heterogeneity is the issue of data existing at multiple sources and represented through differing schemas and/or values. Applications often want to use data from different sources, but first must understand how data from different sources align; they need to either convert all data to a common schema, or identify an appropriate query for each data source to retrieve semantically related data. Examples of discrepancies between schemas include: a) different element names for the same concept, b) different table/nesting structures, and c) one schema containing data that another does not. The paper notes semi-structured data as a particularly challenging form of data, where semantically related data is virtually guaranteed to be in different forms. Reconciling schema heterogeneity is hard because 1) only humans understand the true semantics of data, but 2) it can be time-consuming and technically challenging for them to reconcile themselves. Kinds of heuristics that have been explored for schema matching include: matching schema element names, matching element data types, matching element values, matching schema structure, and considering integrity constraints. An approach that has been explored more recently is leveraging AI/machine learning by using some manually created schema matchings to inform likely/potential matchings for other schemas/parts of schemas. Looking forward, the author discusses 2 major challenges they see in the space of reconciling semantic heterogeneity: 1) supporting huge schemas (and solution searches over them), and 2) supporting dataspaces (complex data sharing environments with numerous data sources and their relationships).

It was great to see an overview of the schema matching problem and researched approaches. Unrelated, at several points in the paper, the author acknowledges that reconciling semantic heterogeneity really cannot be done completely automatically by a machine. It is only humans (in particular, domain experts) who know the semantics of the data, how different schemas align, and what would be appropriate mappings to a reconciled schema. This is something I strongly believe. Machines can help humans by automatically identifying potential schema matchings (therefore speeding up the matching process), but only humans know which matchings are right.

I think it could have been nice to see an overview table of existing approaches and systems (both research and commercial), what kind of data (and how much) they support schema matching for, and what specific work they do and how they involve the user. This would give me a better sense of what problems/domains are well-addressed in this area, and which need more work.


Review 15

This paper mainly talks about the situation that data from multiple data sources is difficult to mix together because of the semantic heterogeneity. Semantic heterogeneity happens wherever there is more than one way to structure a body of data. People can have different views on how to structure their data, even if they are dealing with the same domain. So in order to cooperate multiple data systems, and don't make it a tower of Babel, we need to consider the problem of semantic heterogeneity.

The paper gives several different scenarios when semantic heterogeneity becomes an issue. The first scenario is Enterprise Information Integration. Companies face the problem that they often deal with data from multiple data sources. Meanwhile, data systems are developed independently for targeted business. There are several approaches in this scenario, including data warehousing, custom code solution and query multiple data sources in real-time. The second scenario is Querying and Indexing the Deep Web. The deep website contains content that resides in databases and is accessible behind forms. It is estimated that there is 1-2 orders of magnitude more content on the deep web than the surface web. Since there is no standard web form design, it is not easy to design a web crawl. The third scenario is Merchant Catalog Mapping. There are large online dealers these days. They often ask merchants to use specific schema to upload their merchandise information. However, it is likely that in local storage, the merchants use a different schema to store the information. The forth is Data heterogeneity. Data value can be different even if the schema is the same, e.g. abbreviation, alias.

The paper also discusses why it is difficult to deal with semantic heterogeneity. The first reason is that data systems are developed independently. The second is that it usually requires domain and technical expertise to understand the data and handle the problem. The third reason is that it is challenging for programmers to write the code. State-of-the-art solution is introduced. The paper list class of heuristics to show how schema matching works. At last, leveraging past experience is discussed.

I think the paper is actually a good one. It gives detail scenarios in which semantic heterogeneity becomes an issue. It is really helpful for readers who haven't been aware of the problem to understand how this problem happens and what we can do about it. I learn a lot from the paper and I think the finding is really interesting.

However, the shortcoming of the paper is also clear for me. I think the structure of the paper is not well organized. For example, the "schema heterogeneity and semi-structured data" is not a scenario based on my understanding.


This paper mainly talks about the situation that data from multiple data sources is difficult to mix together because of the semantic heterogeneity. Semantic heterogeneity happens wherever there is more than one way to structure a body of data. People can have different views on how to structure their data, even if they are dealing with the same domain. So in order to cooperate multiple data systems, and don't make it a tower of Babel, we need to consider the problem of semantic heterogeneity.

The paper gives several different scenarios when semantic heterogeneity becomes an issue. The first scenario is Enterprise Information Integration. Companies face the problem that they often deal with data from multiple data sources. Meanwhile, data systems are developed independently for targeted business. There are several approaches in this scenario, including data warehousing, custom code solution and query multiple data sources in real-time. The second scenario is Querying and Indexing the Deep Web. The deep website contains content that resides in databases and is accessible behind forms. It is estimated that there is 1-2 orders of magnitude more content on the deep web than the surface web. Since there is no standard web form design, it is not easy to design a web crawl. The third scenario is Merchant Catalog Mapping. There are large online dealers these days. They often ask merchants to use specific schema to upload their merchandise information. However, it is likely that in local storage, the merchants use a different schema to store the information. The forth is Data heterogeneity. Data value can be different even if the schema is the same, e.g. abbreviation, alias.

The paper also discusses why it is difficult to deal with semantic heterogeneity. The first reason is that data systems are developed independently. The second is that it usually requires domain and technical expertise to understand the data and handle the problem. The third reason is that it is challenging for programmers to write the code. State-of-the-art solution is introduced. The paper list class of heuristics to show how schema matching works. At last, leveraging past experience is discussed.

I think the paper is actually a good one. It gives detail scenarios in which semantic heterogeneity becomes an issue. It is really helpful for readers who haven't been aware of the problem to understand how this problem happens and what we can do about it. I learn a lot from the paper and I think the finding is really interesting.

However, the shortcoming of the paper is also clear for me. I think the structure of the paper is not well organized. For example, the "schema heterogeneity and semi-structured data" is not a scenario based on my understanding.


Review 16

In this paper, the author provides an introduction of semantic heterogeneity, he reviewed several scenarios where resolving the semantic heterogeneity is important, this paper talks about why this problem is hard and introduced several state-of-art techniques for solving this problem, they also point out key problems and opportunities of this area. Semantic Heterogeneity refers to the differences in database schemas developed by independent parties for the same domain. The problem this paper is going to solve is to do a review of existing semantic resolution methods, find out key problems and potential opportunities for improvements. It is definitely an important issue in DBMS fields. Because DBMS can understand other schemas in order to cooperate with each other. Next, I will summarize the key points of this paper with my understanding.

The problem of heterogeneity of semantics is inevitable because of slightly different situations people face and the fact that different people think differently. Some examples include enterprise data, deep web data, and merchant catalog. There are several levels of heterogeneity. First of all, we have schema heterogeneity which means that different database might have different schema name for the same set of concepts. Also, we have data heterogeneity which means that the same object is referred using a different name. For example, IBM and Information Business Machines. Besides, we also have semi-structured data, this is essentially schema heterogeneity, but the fact that any new tags are allowed to be added makes it different because variations are more likely to appear. Next, they talked about why the problem is hard. Solving heterogeneity is hard because the mapping between one schema and other might not be one to one. An element in one schema might map to none, one or a set of elements in another schema. This makes this problem very difficult. Currently what people do is to develop aiding tools for this mapping. These tools provide suggestions to domain expert to assist this mapping process. Heuristics are built using knowledge from 1) Schema element names; 2) Data types; 3) Data instances; 4) Schema structure; 5) Integrity Constraints. Also, this paper introduced a new trend in this research area, which is machine learning. This mapping process can be seen as a learning process when you try to merge a large set of similar schema. Finally, the author shared his prediction for future challenges: larger schemas and more complex data sharing environments.

Although it is a review like paper, it still shares many advantages and I do learn a lot from this paper. I think the main contribution of this paper is that it provides several great insights into the schema resolution field. It summarizes the critical points of semantic heterogeneity and points out opportunities in this area, it is an important guide for the researchers in this area. Besides, I think the structure of this paper is very clear, they give a clear description and several examples in each section which make it very easy to understand. I think some of the insights do make true nowadays, like using machine learning techniques for schema resolution, people can use representative learning to generate feature vectors for entities and try to match among them.

I think this paper is short and just a review for semantic heterogeneity so that nothing to blame. I hope to see more details about the state-of-art techniques for solving semantic heterogeneity. I think they just give some high-level idea of the state-of-art solution but does not penetrate deeper into this the detailed solution. Besides, this paper doesn’t provide any novel techniques or components, it just a review for existing things, although the author discussed some insights of this paper, it doesn’t provide any concrete solution for its insights, I want to see a more concrete solution for potential opportunities. I think some of the prediction in this paper is proving to be false now, like large schemas.



Review 17

This paper is a survey that has 2 goals:

1) Make clear that semantic heterogeneity is an issue, and a difficult one to solve at that. Semantic heterogeneity describes the dilemma faced by people who are trying to compare different database schemas (or even data itself) which are trying to describe the same thing, but due to different semantics, it is almost impossible to prove that any two fields in a schema (or data points) **really** mean the same thing, which is important if you’re trying to develop a system that compares fields in a schema to determine if they contain the same class of information. The paper gives multiple examples of scenarios where this can occur; for example, if a company has the information stored in multiple databases that were each developed independently, or if you are dealing with semi-structured data or data that can be represented in a huge variety of ways (such as web forms). I think the sentence "Differing structures are a byproduct of human nature – people think differently from one another even when faced with the same modeling goal” describes it perfectly—because this is a result of something we can only attribute to “human nature”, it is extremely difficult to truly model correctly with a program.

2) Introduce a number of approaches people have taken to deal with semantic heterogeneity. For example, one way to deal with enterprise information integration is to maintain a semantic mapping that explicitly maps one field name to its matching field name in another database. This is extremely time-consuming and scales very poorly. Another solution is to enforce standard schemas and data representation. This is simply unrealistic and really only possible in smaller, highly controlled environments. Another solution presented is a semi-automated schema matching system that uses heuristics such as name & structure similarity to highlight possible matches, but this is more of an amelioration than a solution. A final observation is that machine learning is applicable to this problem by leveraging past mappings to create better future mappings.

The paper’s main contribution is its highlighting of a very real problem (semantic heterogeneity). I also found the temporal / ML aspect suggestion to be interesting. The main weakness of the paper was that I felt that this is a fairly obvious & intuitive issue and for most of the paper, I was thinking “right I already had assumed all of this was true”. This might be unfair of me as not every paper has to be about something I’ve never known before, I guess.