In the schema matching problem, given two data schemas, one must find matching pairs of similar items (one in each schema). Schema matching is an important part of data integration, where data from multiple sources is combined into a unit that can be accessed all at once. Schema matching is difficult to automate, because there are many ways of naming any schema item, and many structures in which any schema could be organized. Automatic schema matching is not currently feasible in general, but it is possible to program better computational assistants for domain experts.|
The Cupid system is a schema matcher which uses many types of features in combination to perform “mapping discovery.” In the mapping discovery problem, given two schemas, the system must return a set of pairs of related items, one in each schema. The mapping is successful if it is rated as good by an expert user. Cupid uses linguistic features of schema item names, values of instance data, and tree structure comparisons to evaluate items’ similarity to each other. Cupid aims to beat the performance of prior works like DIKE and MOMIS by aggregating many features together through similarity scoring functions, and by using its novel tree-matching algorithm to judge structural similarities.
The authors’ main contributions are a new mapping discovery system called Cupid, a description of its innovative design elements such as the bottom-up tree structure matcher, and experiments showing how Cupid outperforms DIKE and MOMIS on both toy problems and real-world schemas. Cupid works by first applying a category tagger to each item, after which only items within the same category will be compared (a form of pruning). A linguistic matcher assigns a similarity score to each pair of items in each category, after tokenizing and abbreviation expansion on item names. Finally, a structure matcher compares items on the similarity of their descendants in the schema tree, with an emphasis on leaves. Pairs of items with high overall similarity scores are considered matches.
One weakness of the Cupid paper is that it is difficult to tell whether Cupid performs better than DIKE and MOMIS in general, or only on the toy problems selected by the authors. There does not seem to be a standard benchmark data set or evaluation metric for schema matching, so objective comparisons in performance are difficult.
What is the problem addressed?|
Match is a schema manipulation operation that takes two schemas as input and returns a mapping that identifies corresponding elements in the two schemas. Match is such a pervasive, important and difficult problem that it should be studied independently.
1-‐2 main technical contributions? Describe.
They have a key observation: the problem of schema matching is so hard, and the useful approaches so diverse, that only by combining many approaches can we hope to produce truly robust functionality. Therefore, they propose a new schema matching algorithm, called Cupid, which combined a number of techniques: linguistic matching, structure-based matching, constraint-based matching, and context-based matching. The first phase, called linguistic matching, matches individual schema elements based on their names, data types, domains. Linguistic matching proceeds in three steps: normalization, categorization and comparison. The second phase is the structural matching of schema elements based on the similarity of their contexts or vicinities. In the third phase (mapping generation), a mapping is created by choosing pairs of schema elements with maximal weighted similarity.
1-‐2 weaknesses or open questions? Describe and discuss.
A main drawback of this paper is the validation part. Partially because it’s nearly impossible to formulate the input of schema matching, they didn’t provide any theoretical guarantee about their algorithm. And I think the main contribution about this paper is not the algorithm but the insight to isolate the problem of schemas matching .
This paper presents a new algorithm called Cupid, which is designed for generic schema matching. In short, Cupid discovers mapping between schema elements based on their names, data types, constraints, and schema structure. This algorithm is built based on previous solutions, but exploits some new techniques, including integrated use of linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure. This paper first defines the schema matching problem. Second, it presents some past solutions, a taxonomy for schema matching techniques. Then it introduces the new match algorithm Cupid, as well as some design details. Finally, it presents experiments and future work.|
The problem here is that schema matching is very critical in applications (XML message mapping, data warehouse loading, schema integration), however, there is no good general solution. There are some challenges for schema matching. First, schemas have different structures and naming even for identical concepts. Second, schemas may model similar but not identical content. Also, schemas may use similar words to have different meanings. Another problem is that there are many past solutions but they have some shortcomings. Therefore, this paper presents a new algorithm Cupid for generic schema matching.
The major contribution of the paper is that it summarizes a taxonomy of matching techniques and presents a new algorithm Cupid. By doing performance evaluation on a real world example and comparing it with past algorithms, it shows that Cupid gives better performance. Here we will summarize key elements of Cupid:
1. automated linguistic-based matching
2. both element-based and structure based
3. biased toward similarity of atomic elements
4. exploits internal structure
5. exploits keys, referential constraints and views
6. context-dependent matches of a shared type definition
7. generates 1:1 or 1:n mappings
One interesting observation: this paper is very innovative, and idea of building a new algorithm based on past solutions is interesting. It is a good idea to review and summarize past solutions, and better ideas often come from refinements. This paper can be used as a good example, as well as a good reference for generic schema matching.
In this paper, the authors proposed a new algorithm, Cubid, for the problem of schema matching. They begin with describing the problem and presenting a taxonomy for past solutions and they briefly describe the previous implemented tools and the problem(s) of each. Then, they propose their algorithm which is using many different techniques altogether to find the best match. At last, they compare Cubid with two other schema matching tool.|
I think an important advantage of Cubid compared to the previous tools is that it is a general-purpose algorithm which means that it can be used in any context, data model or application. Cubid is using several techniques including linguistic matching, structure matching and matching schema trees. This helps Cubid to find really complicated mapping that the other algorithms where not able to detect.
Although Cubid works great and the result was beyond my expectations, I think it still can be improved or extended by using other techniques such as machine learning applied to data instances and pattern matching.
This paper address the issue that the author of the last paper highlighted - that it is very difficult to automate matching of schemas.|
The authors anticipated this being more than just a system for mapping XML or JSON schemas - they intend it to be used for any 'model' - be it a database schema, XML schema, UML model, workflow, or website map.
The authors discuss the many techniques used in matching. These include schema vs instance based - does the system look just at the schema, or does it need access to the actual data as well? There are also linguistic based matches that use the name of elements within the schemas. This can be dramatically improved by using stemming and tokenization, synonyms and hypernyms, matching substrings, expanding abbreviations, etc. - essentially, almost all of the techniques from Information Retrieval can be applied here. There are also individual vs combinatorial matches - the individual matchers use one algorithm to perform the matching, while the combinatorial will use multiple indepenting match algorithms on the two schema and combine the results.
The Cupid approach is an linguistic-based matching, and is both element and structure based. It exploits keys, referential constraints, and views, so we can say that it is a combinatorial matcher. The linguistic matching is a large portion of the Cupid approach. It does quite a bit of manipulation to make matches happen. For example, it could take something like POLines, and break it into PO, Lines, and then further into Purchase, Orders, Lines. It then categorizes these words, and applies a similarity function to get a numeric score. The system also uses structure based matching. It uses a post-order traversal to get a uniquely defined result for each tree. They considered using a top down traversal, but this was too optimistic, and performed poorly if the top level of the schemas were different.
This paper does a good job of showing how combining and expanding on existing techniques can create a dramatically better solution. The authors didn't really have to create anything new, but, they decided to combine the algorithms that many had used before them, and applied algorithms from Information Retrieval to schema matching to create a viable product.
I had two problems with this paper
- One, I feel like this is a problem that we shouldn't have. Enterprises should manage their data better, enforce data and schema rules, and avoid replicating data. I understand that this will probably never happen.
- My second gripe is that, and this is something they mention multiple times in the paper, there is no way to empirically compare results from schema matching. While they can show use their results, it really comes down to whether or not the system does the same matchings that I would make, if I were to manually map the schemas.
Problem and Solution:|
Many fields including XML message mapping, data warehouse loading, data translation and schema integration require schema matching. However, it is challenging because even for the same concepts, different schemas have different structure and names. It is also possible that similar names refer to different concepts. Currently the schema matching are done manually with limited help from tools. The problem proposed in the paper is to compute the mapping of elements of two schemas programatically. There are many different matching techniques and published implementation systems. But the systems use only a few matching techniques and none of them provide a complete general schema matching components. The solution proposed is the Cupid algorithm. The algorithm is schema-based and shares some general approaches with other algorithms. Cupid can solve XML and relational schemas matching. Comparing to two other schema matching systems, Cupid provides more general use and higher performance.
The main contribution of the paper is proposed the Cupid approach. Cupid is schema based with linguistic matching and exploits internal structure, keys and so on. The linguistic matching is the first phase, which is based on the names of elements. Names are normalized, categorized and compared in this stage. After that we have a table of linguistic similarity coefficients of elements between two schemas. Structure matching is the second phase used to find the two nodes with similar structures. And finally the mapping comes from the computed similarities of the first two phases. By comparing the similarities of two main factors, the mapping can be more accurate. And since the two factors exist in all schemas, the approach provides a general-purpose solution.
The biggest weakness of the paper is that the algorithm proposed in the paper is implemented without working with other techniques in the filed. The comparison and experiments in the paper proves that the cupid algorithm has better performance than other algorithms who solve the same problem without any other components. But it is not sure whether this algorithm is applicable in a real system.
Schema matching problem is given two input schemas in any data model and auxiliary information and an input-mapping, compute a mapping between schema elements of the two input schemas that passes user validation. Schema matching is critical in many applications, such as XML message mapping, data warehouse loading, and schema integration. |
The paper introduce a generic schema matching algorithm - Cupid which improved on other schema matching algorithms and can apply to many different applications domains. It solves the problem by computing similarity coefficients between elements of the two schemas and then deducing a mapping from those coefficients which in the [0,1] range.
Linguistic matching, matches individual schema elements based on their names, data types, domains, etc. There are three steps for the linguistic matching: normalization, categorization and comparison. The result of this phase is a table of linguistic similarity coefficients between elements in the two schemas. In the structure matching phase, it takes the result from the linguistic matching(lsim) to calculate the ssim, and from ssim and lsim, the weighted similarity wsim is computed.
The paper introduce a schema matching algorithm which is generic and integrate many other algorithms. It exploits linguistic, data-type, structure and referential integrity information and can provide the robust functionality.
Cupid is a robust algorithm for schema matching. Also paper says that the Cupid will take advantage of auxiliary information which means that it will use past match information to tune the system. But when the matching rules changes(the input changes), the matching system will still using past matching information which is not robust for the input change. Also, the feedback using past matching information will lead a “system configuration saturation” problem which is a typical feedback tuning problem. It means that as the time passes, the new feedback information will make less and less change to current system.
It is extremely common for databases to have differing schemas for its data, even if the data in both databases belong to the same domain. Typically, someone who wants to integrate this data will need to find a semantic mapping- a way to translate one data schema into the other. When there are a large number of data sources, like a website that tries to integrate information about different products, these mappings may be too time consuming to produce by hand. The authors of this paper propose CUPID, a system that tries to automatically create semantic mappings. There are many different techniques that can be used for schema matching- looking for linguistic similarities, data type similarities, comparing data ranges and uniqueness etc. There are many prototype matchers that focus on implementing one technique for matching to solve a subset of matching problems, but CUPID leverages many techniques to try and act as a general purpose schema matcher.
CUPID first normalizes schema attribute names, and tries to categorize the attributes into multiple groups, based on semantic categories, data types etc. An attribute can be in multiple categories, and these categories can be used to compare the similarity of two attributes. CUPID doesn’t just compare attribute-by-attribute; instead, it tries to view the schema as a tree. This allows CUPID to judge two elements’ similarity not just by the similarities of the actual elements, but also the similarity of child elements. This allows CUPID to try to match whole schema trees to each other.
CUPID is a first of its kind in trying to combine different techniques to solve the schema matching problem. It is more flexible and versatile than previous prototypes of schema matchers.
In the last paper we read about how many automated techniques cannot eliminate the need for human interaction for schemas with hundreds of rows, because it is more likely that a larger more complex schema will be harder to match automatically. Does CUPID do anything to alleviate this problem?
The paper proposes a new algorithm called Cupid for generic schema matching problem. The algorithm basically combines different techniques from the past. The paper claims that if we apply the previous approaches individually it is ineffective in solving schema matching problem. Cupid aims to provide a more robust solution by combining these techniques into a single framework.|
Cupid solves schema matching problem by calculating similarities between each of schema elements so that it can suggest mapping between elements that are similar to each other. Cupid views a schema as a graph, so the problem of schema matching becomes measuring the similarity between two graphs by comparing their individual nodes and also their structure as a whole. I think their interpretation is reasonable and intuitive, but it seems that Cupid only works with trees and cannot handle graphs with cycles.
The similarity between elements is calculate by two matching metrics, linguistic matching and structure matching. Linguistic matching compares the similarity between two elements by a thesaurus lookup. Structure matching is done by consulting ancestors and children of the sub-trees, looking their similarities to calculate the similarity of the nodes in the tree. The use of a thesaurus for calculating the similarity between two elements is somewhat understandable, but in turn the similarity measure is limited by the capability of the thesaurus that is being used. Technical terminologies or jargons may not be captured and compared correctly with a thesaurus.
In their experiment, Cupid does have limitations that it does not match some elements of schemas correctly. I do not understand why the authors have designed their algorithm to be solely schema-based. The authors promote their algorithm as a combination of past approaches for a robust solution, yet it is still limited as being a schema-based algorithm. For a more sophisticated and accurate matching algorithm, I think a matching algorithm should look at actual data of two schemas as well, i.e., instance-based. The paper does not seem to justify their design choice and I wonder why.
In conclusion, the paper proposes a new schema matching algorithm called Cupid. Through the description of the algorithm, the readers can see how matching problem is solved in general. It would have been better if Cupid was also instance-based, yet they are other instance-based algorithms for schema matching and readers who are interested should look at them instead.
This paper introduces a new schema matching algorithm called Cupid. Cupid is designed with the intent to a generic algorithm that works for multiple different schemas. The Cupid algorithm is composed of two passes: a linguistic matching pass and a structural matching pass. Afterwards, Cupid outputs possible mapping between schemas. One of the key components of the linguistic matching pass is the use of a thesaurus. The thesaurus contains abbreviations or alternate forms of works allowing Cupid to identify two fields with different names may actually relate to the same object. The second phase of Cupid determines the similarity of schema elements based on a variety of factors such as subtree or data type similarity.|
I thought Cupid was weak in several respects. The first is the use of the thesaurus. It seemed to me that the thesaurus needs to be generated by a human before Cupid runs and that it was not an automatic process. For small schemas, this would work well, but for larger schemas, the thesaurus may become unreasonably large to initialize manually. This leads to the second weakness is that I do not think the experimental results accurately reflect real world scenarios. I felt like they did testing a few cases, but I would have like to see them test against large and small schemas as well as schemas from two very different organizations, but with a similar need for a certain schema. An example might be the Amazon customer database an a database for credit card owners. The last weakness which was mentioned by the paper is the lack of machine learning. I think if a generic algorithm is to be made, it should definitely use machine learning. Using this would also help the algorithm match similar schemas it has seen before even fast. Furthermore, the thesaurus could be automatically created from training runs with some user feedback.
This paper focuses on schema matching which can be a solution for the semantic heterogeneity problem as we see before. It first presents some previous technics and shows the drawbacks of them; it then proposes "Cupid", a new algorithm that does schema matching by considering the names, data types, constraints, and schema structure of the schema elements.|
The Cupid algorithms utilize multiple domain techniques to solve the schema matching problem. It applies some natural language processing such as tokenization, abbreviations and acronyms expansion, elimination, concept tagging, schema element clustering, linguistic similarity, ... and so on. Also, it considers the schema structure of hierarchical schemas and compute the structure similarities of schema elements; the TreeMatch algorithm is performed to achieve this concept.
The contributions of this paper include that:
1. It discusses the schema matching problem and studies some previous approaches of solving this problem.
2. It proposes a new algorithm Cupid that mainly leverages linguistic and structural informations to improve previous methods.
3. It gives some examples to illustrate the Cupid algorithms and show its effectiveness.
Some future improvement of this paper:
According to the "Why Your Data Won’t Mix" paper, a better solution to the schema matching problem should leverage past experience, that is, to consider past experience and apply some matching learning techniques to it. The Cupid algorithm might be improved by integrating such information.
This paper focuses on schema matching which can be a solution for the semantic heterogeneity problem as we see before. It first presents some previous technics and shows the drawbacks of them; it then proposes "Cupid", a new algorithm that does schema matching by considering the names, data types, constraints, and schema structure of the schema elements.
The Cupid algorithms utilize multiple domain techniques to solve the schema matching problem. It applies some natural language processing such as tokenization, abbreviations and acronyms expansion, elimination, concept tagging, schema element clustering, linguistic similarity, ... and so on. Also, it considers the schema structure of hierarchical schemas and compute the structure similarities of schema elements; the TreeMatch algorithm described in this paper is performed to achieve this concept on tree-structured schemas.
This paper also discusses the case that the schemas are that in tree structures. They push their algorithms to more general forms. For exampel, a more general form of a schema is a graph, where a node can have more than one parent in the structure. The natrural language processing part is not effected in this case while the structure discovering part is effected. This paper gives some modification on the Cupid algorithms for this problem in detail.
The contributions of this paper include that:
1. It discusses the schema matching problem and studies some previous approaches of solving this problem.
2. It proposes a new algorithm Cupid that mainly leverages linguistic and structural informations to improve previous methods.
3. It gives some examples to illustrate the Cupid algorithms and show its effectiveness.
Some future improvement of this paper:
According to the "Why Your Data Won’t Mix" paper, a better solution to the schema matching problem should leverage past experience, that is, to consider past experience and apply some matching learning techniques to it. The Cupid algorithm might be improved by integrating such information.
Purpose of Generic Schema Matching with Cupid:|
Schema matching is important for applications such as XML message mapping, data warehouse loading, and schema integration. This paper examines algorithms for generic schema matching, and proposes Cupid, which finds mappings between schema elements based on names, data types, constraints, and schema structure. Cupid is innovative because of its integrated use of linguistic and structural matching, context-dependent matching of shared types, and bias towards leaf structures where a lot of the schema content resides.
Details on Cupid:
Cupid starts by computing coefficients between elements of two schemas and then finding the mapping from those coefficients. Coefficients are calculated in two phases. The first phase is linguistic matching, which matches individual schema elements based on their names, data types, domains, etc to result in a linguistic similarity coefficient between every two elements. The second phase is structural matching, which is based on the similarity of the contexts or vicinities, to result in a structural similarity coefficient. An example of this is that Line is mapped to ItemNumber because they have a matching parent of Item, and the other two children of Item match. The weighted similarity, which is the mean of lsim and ssim, is used to choose pairs of schema elements in the mapping generation phase.
From the experiments, linguistic matching of schema element names leads to useful mappings. The thesaurus plays a crucial role in linguistic matching. If we use linguistic similarity with no structure similarity, Cupid will not be able to differentiate between instances of single XML-attribute in multiple contexts. Single classes might be nested or normalized differently in different schemas. Using the leaves in the schema tree for structural similarity computation allows the Cupid to match similar schemas with different nesting. Correct queries for data transformations can be generated from a query discover module, because mappings in terms of leaves are reported. Better matching is achieved through structure information beyond the immediate vicinity of a schema element. The construction of schema trees generates context dependent mappings, which are useful when inferring different mappings in different contexts for the same element. Auto-tuning is an on-going problem that requires a robust solution, because tuning performance require expert knowledge. User interaction is critical in schema matching, and one of Cupid’s drawbacks is that it currently restricts user interaction to the initial mappings supplied at the beginning of the matching procedure.
Strengths of the paper:
I enjoyed reading the paper because it provides a concise overview of Cupid. I found the integration of linguistics with this problem an interesting cross field connection. It was also interesting to see how they look cross layers, at parent and child nodes, to decide how similar fields are semantically.
Limitations of the paper:
I would’ve liked to see the paper provide quantitative experimental results for the how Cupid compares with other algorithms on real-world data. I would’ve also liked to the see the paper describe were Cupid is being used in the world today.
The paper "Generic Schema Matching with Cupid" tackles the general problem of automatically matching schemas. This is essentially the problem of taking two different sets of data which are labeled or structured in different ways and discovering which values and relations in the different representations correspond to the same information. This is clearly an important problem and the authors list the applications of XML message mapping, schema integration, and data warehouse loading as potential applications. Cupid attempts to take the strengths of other techniques and combine them into a more complete method for schema matching.|
This paper's advantages become evident in the ninth section of this paper but are described earlier. The survey like nature of the paper is useful to introduce us to the previous approaches in this area of research. We can then see Cupid's main contributions which include the use of a more advanced linguistic matching algorithm and structural matching that is biased toward leaves of a schema. Finally, their comparison to the MOMIS and DIKE systems provides a clear insight into how well Cupid works and what is left to be done.
The drawbacks of this paper are seen in how their algorithm chooses to match schemas. The method of writing an algorithm by hand to decide how to match is imperfect and requires lots of tweaking of hyper parameters. The authors did suggest a machine learning approach in their future work. This would presumably require many schema pairs with the matching elements labeled. Data like this might not be that difficult to collect and it would be interesting to see how it compares to methods presented in this paper. The linguistic matching algorithm could also be improved. Word Net does not contain any domain specific words. Depending on the domain that is used it may help some or not at all. The idea of matching schemas to some base concept is referred to as entity linking in semantic web or language understanding and named entity recognition domains. There are machine learning methods for this as well that may be applicable to the schema matching.
Part 1: Overview|
This paper analyzes schema matching, which is crucial in XML message mapping, data warehouse loading, and schema integration. In particular, they analyze the generic schema matching using Cupid, a novel algorithm that discovers mappings between schema elements based on the names, data types, constraints, and schema structure. Match is treated as a key of general purpose system for managing models including XML, URL, and many other structures. The goal of schema matching is that, given two input schemas in any data model and, optionally, auxiliary information and an input-mapping, compute a mapping between schema elements of the two input schemas that passes user validation.
The Cupid approach, beside being generic, includes automated linguistic-based matching. It is both based on elements and structures. It is biased toward similarity of atomic elements where they can probably catch the schema semantics. Internal structure could also be utilized as well as the foreign keys and other referential constraints. Contexts are taken into account when doing matching. Cupid can generate one to one or one to many matchings. Linguistic matching includes normalization, categorization, and comparison. Tree matching catches the similarities between structures. Referential constraints are supported by including RefInt variable. Cupid arguments the referential tree and discovers the relationship of foreign keys.
Part 2: Contributions
They integrated linguistic and structural matching which are novel ideas at that time. Normalization is done by tokenization, expansion, tagging, and abbreviations. Categorization is used to cluster the elements to reduce times of comparison. Schema elements are categorized by their concept tags, data types, and containers. The similarity between names are calculated by matching substrings.
Real world examples are tested in the evaluation experiments. And Cupid outperforms DIKE and MOMIS. It turns out that mining elements’ names is effective with linguistic matching, where DIKE and MOMIS only allow identical names.
Part 3: Drawbacks
Linguistic matching here is still based on some predefined heuristics and could be inaccurate. Similarity between tokens is determined by the length of the common substrings which does not catch some words with totally different spelling however have the same meaning.
The paper discussed about Cupid, an algorithm that discovers mappings between schema elements based on their names, data types, constraints, and schema structure, using a broader set of techniques. The problem is that schema matching is challenging for different reasons. For example, even schemas for identical concepts may have structural and naming differences. And also schemas may model similar but non-identical content. Schemas may uses similar words to have different words. In order to address these challenges, Cupid offers a new match algorithm that uses more powerful techniques and that is generic across data models and application areas. |
The algorithm computes similarity coefficients between elements of the two schemas and then deduce a mapping from those coefficients. The coefficients are calculated in two phases. The first phase, called linguistic matching, matches individual schema elements based on their names, data types, domains, etc. The authors use a thesaurus to help match names by identifying short-forms (Qty for Quantity), acronyms (UoM for UnitOfMeasure) and synonyms (Bill and Invoice). The result is a linguistic similarity coefficient between each pair of elements.
The second phase is the structural matching of schema elements based on the similarity of their contexts or vicinities. For example, Line is mapped to ItemNumber because their parents, Item, match and the other two children of Item already match. The structural match depends in part on linguistic matches calculated in phase one.
The main strength is that it was done at a time when there was not that much effective algorithm for schema matching. I like the fact that it lay ground for future works.
The main limitation of the paper is that the paper was published when un structured data wasn’t that popular and manual matching was prevalent. As a result, it would be great to evaluate the algorithm with current workloads to see the effectiveness.
This paper presents an overview of existing schema matching techniques and provides their new algorithm, Cupid. First, the paper made it clear as the last paper that the schema matching problem is important and challenging. It is challenging because the schemas come out of different sources have all kinds of confusion, mismatches, and traps for matching. As for existing techniques to match schemas, they can be categorized. The paper provides a taxonomy and takes a look at the performance of some implementations. These publications include: SEMINT, DELTA, LSD, SKAT, TranScm, DIKE, and ARTEMIS. They all uses different combinations of the existing techniques.
Cupid is the algorithm proposed by the author of this paper. The author argues that the former solutions or implementations are all incomplete. They exploit at most a few of the techniques in the taxonomy. Since the problem is so complex and nondeterministic, we need to apply as many techniques as necessary to make the solution robust. So the solution this paper proposed has the following properties: 1. includes automated linguistic-based matching 2. is both element-based and structure-based and so on. In summary, it goes through several phases: 1. linguistic matching 2. structural matching 3. mapping generation. The algorithm looks at all such clues as language similarity, structural similarity and item match.
It is vital that the paper set a proper goal for automatic schema matching, which is, given two input schemas in any data model and, optionally, auxiliary information and an input-mapping, compute a mapping between schema elements of the two input schemas that passes user validation. Because schema matching is inevitably subjective, it becomes questionable to get the best matching. The argument of that really make a point.
The strength of this method over machine learning algorithm mentioned as future work in the last paper is clear. We don't need large amount of training data.
It is a rule based matching algorithm. As I said, it has clear advantage over the machine learning methods. However, over time, fixed weight over the sub-algorithms applied is not leveraging passing experience, where machine learning still has a role to play when the matching algorithm runs over time and over large amount of cases.
In this paper, the authors discussed what kinds of techniques could be applied to schema matching problem and presented their matching tool called Cupid.|
This paper first looked at all orthogonal criteria that categorize matchers. It depicts how a matcher behaves, whether it look at instance data, whether this matcher uses auxiliary information and so on.
Then after went through shortcomings of earlier matcher, they presented a simplified version of cupid. This version of cupid goes through three main stages to generate a schema map. The first stage is Linguistic matching. In this stage, the match deals with linguistic. Words with similar meanings are found and similarities are computed based on linguistic analysis. The second stage takes the structure of the schema tree into account. It computes how similar two trees are and map them based on this similarity. And finally, a mapping is generated based on previous stages.
Next this paper extended the above results so that schema with more general structures, such as graph, can use this algorithm.
I think this paper has several contributions. The first contribution is that it made orthogonal criteria for matchers. Using these criteria, if someone what’s to build a new matcher, he can decide what he wants to achieve before he actually begins the development. The second contribution is that they showed some important algorithms that one might need to build a matcher.
In the later half of this paper, a few enhancements are introduced. I wonder how those enhancements works together with each other. What will happen if those enhancements do not agree with each other? I didn’t find information for solving this problem.
This paper introduces an algorithm Cupid that can match schemas with each other across data models. Since schema matching is commonly agreed to be subjective because of which it becomes a difficult problem to solve. |
The elements belonging to similar categories first go through a linguistic matching attempt i.e, the name similarity of the elements using the thesaurus and with the help of synonyms and hypernym after which a linguistic similarity co-efficient is calculated. The next phase is where two elements are matched in terms of structure. A weighted average of this score is calculated and used in order to come up with the final mappings. One of the unique things that Cupid does is a bottom-up search rather than a top-down search which even though is more expensive, is able to match moderately varying schemas better than the top down approach.
One of the positive points about Cupid was that it does not get thrown off by different nesting of similar elements since it searches up from the leaves. Another positive point was the idea of dividing elements into categories based on keywords present in the names thereby resulting in slightly fewer elements to match in comparison to all the schema elements from one to another.
However, somehow, I am not too sure if Cupid does anything different but aggregate the separate mechanisms of its predecessors except for the fact that it can map context-dependent elements. Like the authors themselves have mentioned, including a machine learning component that used mappings from previous many samples could have greatly improved the result produced by the algorithm.
That being said, considering the intractability of schema-matching, this definitely looks like a solution that can be worked towards, in case semi-structured data become relevant anytime in the future.
Generic Schema Matching with Cupid paper review|
In this paper, the author introduced the problem of schema matching algorithm. And schema matching is the process of finding correspondences between pairs (or larger sets) of elements of the two schemas that refer to the same concepts or objects in the real world. This problem is hard because that in most cases, the same domain are represented differently in two systems. There were many unsuccessful solutions in the past, and an orthogonal taxonomy for past solutions are: Schema vs. Instance based, Element vs. Structure granularity, Linguistic based, Constraint based, Matching Cardinality, Auxiliary information and Individual vs. Combinational.
The main contribution of this paper is a new schema matching algorithm, called Cupid, which combined a number of techniques: linguistic matching, structure-based matching, constraint-based matching, and context-based matching. Most of the later approaches to schema matching have used this hybrid matcher approach, which leverages different criteria to arrive at suggested correspondences.
However, the major drawback of this paper is that, the cupid algorithm itself is not a complete solution to the Generic Schema Matching problem. Instead, I would say it is just an improvement of the previously existing two approaches. As mentioned in the paper talking about Semantic Heterogeneity, I strongly believe that if any system could add a machine learning component to group all possible schemas into different clusters, it can better understand the element in each schema and make more accurate matching for variants of the same domain.
Matching is a schema manipulation operation that takes two schemas as input and returns a mapping that identifies corresponding elements in the two schemas. Schema matching is a critical step in many applications such as XML message mapping and data warehouse loading. In this paper, the author introduces a new algorithm named Cupid. It finds mappings between schema elements based on their names, data types, constraints and schema structure. |
There are three steps for the Cupid:
(1) Linguistic matching matches individual schema elements based on their names, data types, domains, etc and get the lsim(linguistic similarity coefficient) which is a coefficient in the range 0 to 1 between pairs of schema element names, one in each schema.
(2) Structural matching matches schema elements based on the similarity of their contexts or vicinities and get the similarity coefficient between element pairs that accounts for the similarity of related elements.
(3) General mapping takes the wsim which is computed from structural matching to compute the individual mapping elements.
The paper present a generic schema matching algorithm that can apply to many different data models and application domains. It is schema based and uses linguistic and structural matching, context-dependent matching of shared types, and a bias toward leaf structure. The output mapping is fed back into the structure matcher as an input hint.
As the Cupid is a schema-based algorithm not instance based, it is not suitable for the schema-later data. Also, the experiment only compare the cupid with other schema-based strategy. It would be better if it also compares the cupid with some instance-based schema matching algorithms and analysis the advantage and disadvantage.
The authors present their solution to the problem of creating schema mappings. The first discuss the problem and existing approaches before discussing their solution. The first thing to note about the authors solution is that it appears to combine all of the discussed existing approaches.|
Besides using a combination of existing techniques, the authors normalize and categorize elements to reduce the number of comparisons they need to make (while still allowing for flexibility in the element name).
While reading this paper, I couldn't help but remember our earlier discussion about the similarity of XML databases to CODASYL while reading the sections of tree and graph matching.
This paper went into depth about their solution to the schema-matching problem, but I felt that there was little insight in the solution proposed by the authors. Additionally, I am skeptical about the evaluation done as such comparisons are likely biased as the authors are likely to ensure that their solution performs well on the selected schemas, but there isn't conclusive evidence to assume that it also performs well in general.
This paper introduces the schema matching problem and how to solve the schema matching problem with the proposed algorithm, Cupid. In the schema matching problem, we are given two schemas in separate databases and we want to create a mapping between the two databases where certain elements in schema one are related to certain elements of schema two. For example, if we have two tables called Items and in one we have the columns Line, Qty, and Uom and in the other we have ItemNumber, Quantity, and UnitofMeasure, the matching should return that Line and ItemNumber are related, Qty and Quantity are related and Uom and UnitofMeasure are related.|
The authors then propose the Cupid algorithm, which they believe is a general purpose schema matching algorithm. Cupid’s mappings have the following properties:
1. It uses automated linguistics based matching.
2. It takes into account both element and structure-based mappings.
3. It is biased towards the similarity of the leaf elements.
4. It uses the internal structure of the database to figure out potential mappings.
5. It exploits foreign keys in tables to figure out relationships between tables.
6. It can either generate 1 to 1 mappings or 1 to N mappings.
The following are the positives of this paper:
1. It provides an algorithm for an open problem in the computer science field that can be used in the real world. Many of the considerations when designing the algorithm will actually occur in practical applications.
2. After explaining the algorithm, they provide a real world example, which shows how well the algorithm runs compared to other matching algorithms.
Overall, the paper does a solid job of describing the Cupid algorithm and why it is better than the existing matching algorithms, but I still have some concerns about the paper:
1. In the real world example part, they presented only one example. We do not know if this one example is biased towards Cupid so that it would look like this algorithm does better than the others.
2. The authors did not compare the running times of these algorithms. Is the running time of Cupid comparable to the other algorithms so that the advantages are worth it?
This paper discusses Cupid, a tool for generic schema matching. Matching schemas is an important operation in many applications, but there are several key obstacles that make automating this process very difficult. Schema matching programs are faced with the challenge of finding a mapping between two schemas that were most likely designed by different people with different intentions. Elements that represent the same information may have slightly different structures in different schemas and may also have different or abbreviated titles.|
Cupid employs several strategies to solve the schema matching problem. After converting the schema to a schema tree, the first step Cupid employs is linguistic matching. In this phase, Cupid breaks up element names into a set of tokens, expands abbreviations and acronyms, then assigns concept tags to elements based on their token set. Cupid then compares the concept tags, data types, and position within the schema structure to assign a linguistic similarity score to each pair of elements. In the second phase, Cupid performs structure matching. While many other schema matching programs compare the structures of schemas using a top-down approach, Cupid uses a bottom-up structure matching algorithm. Leaf elements are assigned initial similarity scores based on linguistic and data type similarity. Non-leaf elements are compared on the same traits as well as on the similarity of the subtrees rooted at the elements being compared. Similarity scores of lower elements can be recalculated based on how similar their ancestors are in order to better capture the similarity of the intermediate structure.
In a comparative study with DIKE and MOMIS-ARTEMIS, it was found that Cupid performed better in several key categories. Because of Cupid’s recursive calculation of leaf similarity scores based on ancestor similarity, it performed much better when matching schemas where data elements were nested in different ways. The top down approaches of the other systems failed here because they recognized elements as dissimilar as soon as the structures diverged, causing them to prune some viable options from their list of match candidates. Cupid also performed better when matching schemas that had high numbers of referential constraints between tables.
While I feel that the authors adequately displayed that Cupid is superior to several comparable systems, I would have liked to see the quality of the final output mapping. While it may perform better than other systems, it is by no means perfect. I would have liked to see how close Cupid’s matching was to a manually-created matching and how much manual correction had to be performed in the final generation of the mapping.
This paper proposes Cupid, which is a method to discover mapping between schemas elements based on their names, data types, constraints and schema structure, using a broad set of techniques. Schema matching is a critical step in many semi-structured applications, such as XML message mapping, data warehouse loading, and schema integration. Therefore, this paper proposed a new method called Cupid to deal with this problem.|
First, the paper introduces the schema matching problem. Matching is schema manipulation operation that takes two schemas as input and returns a mapping that identifies corresponding elements in the two schemas. There are some reasons why schema matching is challenging for many reasons. For example, schemas may have structural and naming differences, and they may use similar words to have different meaning, and so on. Thus, these reasons make schema matching a difficult problem.
Second, the paper proposes their algorithm for schema matching called Cupid. Like previous approach, Cupid attacks the problem by computing similarity coefficients between elements of the two schemas and then deducing a mapping from those coefficients. In addition, Cupid has some unique properties that make it better than previous approaches. It includes automated linguistic-based matching, and it is both element-based and structure-based. These properties make Cupid more general to obtain better performance and result for schema matching.
The strength of this paper is that it proposes Cupid, and provides a detailed description about it, including the algorithm, comparison with previous methods. The detailed discussion makes this paper stronger for illustrating its ideas to the readers.
The weakness of the paper is that it provides few examples when describing the Cupid algorithm. Since Cupid is a new approach proposed by this paper, I think it would be better if it provides some examples to illustrate the ideas of Cupid. This would make it clearer for readers to obtain the key point of the method.
To sum up, this paper proposes Cupid, which is a method to discover mapping between schemas elements based on their names, data types, constraints and schema structure.
This paper is an introduction to Cupid, an algorithm for schema matching. Schema mapping is something that’s important and very common to do between different data sets within an organization or just different data sets that are similar and might communicate with each other. The paper claims that schema matching at the time of the paper was done manually by domain experts using a tool that at best would detect exact matches. |
The breakthrough with Cupid is that it uses many natural language processing techniques to match schemas. It looks at synonyms and other things things to relate words to each other and give them a similarity score for each item; they call this linguistic matching. Then the second phase of this algorithm is called structural matching which find similarities between two things by the similarities to their vicinities. The example in the paper is both Deliver and Invoice have children City and Street in one schema and Ship and Bill have the children City and Street in the other schema. By using structural matching, you can see the City and Street children from Invoice match up with the City and Street children from Bill (rather than ship) in the other schema because of the similarities of their parent node (Invoice to Bill).
The paper goes more in depth into the math behind the algorithm but I think those two things are the main concept that you need to understand to understand cupid.
My main concern with this paper was I felt like it left a decent bit on the table to be improved like data types of each element and any patterns that occur within a certain element. It touched on the data types briefly but I think that is crucial information for schema matching (not just element title). Basically, my main quirk is all this algorithm seemed to look at was linguistic elements of schema which I think can be insufficient and sometimes even misleading. I also am not fully convinced that this is a major problem that needed to be addressed. Lastly, I might be wrong about this but it doesn’t seem like it was using any breakthrough technology, it seemed like they just took relatively simple things from the NLP community and applied them to this problem for a quick fix.
Overall, the paper was easy to read and a good paper, but I’m not sure it was solving a major problem or that the solution was very technically impressive. The paper was worth a read and I’m excited for the in class discussion to see if maybe this problem was much larger than the paper convinced me.
Schema mapping was one of the methods for resolving semantic heterogeneity in the previous paper; the problem involves having two or more schema systems that may represent logically similar information. The solution is two-fold: we first want to *match* the two systems in order to find out if there is any relation between the two. Second, we want to discover a *mapping* such that certain parts of one schema can be logically transformed in some way to parts of the other schema. This is useful for when we have multiple datasets (semi-structured or we simply have a collection of structures that we don’t want to sift through to find relations) and would like to perform joint operations or avoid redundancy in data manipulation. However this problem is difficult as there may be countless representations, logically and linguistically, of the same data model. The paper gives an overview of other matching techniques, and introduces their new approach for a generalized matching algorithm, Cupid.|
Other matching techniques consist of a hodgepodge of styles: considering schema information versus instance information, considering elements and attributes rather than holistic structure, or even semantic structure of elements. SEMINT, for example, defines constraints to map instance structure into a high-dimensional euclidean space and clustering by distance. DIKE is another hybrid system that leverages an assumption that schema elements closer to each other are more likely to be related than elements far away, and uses semantic analysis in order to determine influence of which elements are more likely to be related. Each prototype described (many are not mentioned here) is said to be incomplete, as they only make use of a small subset of features that can be used to make a more powerful schema representation. Cupid, on the other hand, attempts to combine many features to tackle the schema matching problem, explicitly: linguistic text matching, element and structure matching, inspection of atomic comparisons, and accounts for keys/references/views.
Cupid first decomposes a schema into a hierarchical tree and uses a dictionary to match common abbreviations between schemas, and, through a second phase, performs structural matching in order to match subtrees that appear very similar in form. The first phase, linguistic matching, tokenizes element names and performs dictionary/thesaurus-based tagging in order to group conceptually similar elements. Unigram matching follows this process to compute semantic similarity between elements, and the end result is a table of matching coefficients between schema elements. Traversal between schema trees outputs a list of mapped elements between leaf and non-leaf elements. However, since schema models are not necessarily trees, they extend cupid to represent graph structures (through the simple addition of containment/aggregation/propagation semantics). By matching referential constraints and traversing the DAG of the aforementioned schema features, they build a schema matching between nodes in each graph. They then present a thorough explanation of their findings and results, as well as conclusions based on their findings comparing their matching to the outputs of other algorithms.
Overall, I thought this paper was comprehensive in its study and description of Cupid. They included an analysis of what worked and what didn’t (e.g. using elements not necessarily “close” in order to boost matching), as well as a “real world” example of schema matching in the BizTalk system, as well as an evaluation between several systems. I think this was a very rounded out and thorough paper overall, and for their purposes is quite self-contained. Perhaps one area for improvement would be to have a more thorough explanation of their “optional features” section, especially since initial mappings seem to play a critical role in their schema matching system
The paper talks about Cupid, an algorithm to discover mappings between schema elements based on their names, data types, constraints, and schema structures by using a broader set of techniques than past approaches. Usually, schema matching is studied as a part of bigger application. This paper wants to show us that Match operation is a pervasive, important, and difficult problem that should be studied independently. |
The paper focuses on schema matching problems related to mapping discovery, the activity that returns mapping elements that identify related elements of two schemas. Since schema matching is inherently subjective, another goal of schema matching is to get validation from user. Various schema matchers employ various matching techniques: schema vs. instance based, element vs. structure granularity, linguistic based, constraint based, matching cardinality, auxiliary information, or individual vs. combinational. The paper mentions several schema matchers that use (combination of) these techniques (of which, DIKE and MOMIS-ARTEMIS would later be used in comparison with Cupid). Cupid is build for the purpose of providing general-purpose schema matching component by combining those techniques.
Matching is done in two phases. The first is linguistic matching, which matches individual schema elements based on their names, resulting in linguistic similarity coefficient. This phase consists of normalization, categorization, and comparison operations. The second phase is structural matching of schema elements based on the similarity of their context or vicinity. Then, Cupid does map generation. From this point, Cupid extends the general schema by matching shared types and matching referential constraints. Other features in Cupid are optionality, views, initial mapping, lazy expansion, and pruning leaves. Lastly, the paper does comparative study between Cupid, DIKE, and ARTEMIS-MOMIS. It uses canonical example and real world example. The study using canonical example shows that Cupid is able to overcome some differences in schema element names due to normalization performed as part of schema matching, Cupid is robust with different nestings of schema elements due to its reliance of leaves, and Cupid is the only tool that can disambiguate context dependent mappings. However, in real world example, it shows that none of the systems tested could match the CustomerName column in one schema to either ContactFirstname or ContactLastName on the other schema. The experiment is concluded as such: (1) linguistic matching of schema element names results in useful matching, (2) The thesaurus plays a crucial role in linguistic matching, (3) Cupid cannot distinguish between the instances of a single XML-attribute in multiple context using linguistic similarity with no structure similarity, (4) granularity of similarity computation matters, (5) using the leaves in the schema tree for structural similarity allows Cupid approach to match similar schemas that have different nesting, (6)it is better to incorporate structure information beyond the immediate vicinity, (7) context-dependent mapping generated by constructing schema trees are useful, (8) Need to analyze performance parameters (auto tuning), and (9) user interaction is important (which is not yet captured in this version of Cupid).
One of the major contributions of the paper is it shows that there are ways to do generic schema matching, despite difference in domains (or being too dependent to domain experts) – although it is not fully independent. In this case, Cupid integrates the use of linguistic and structural matching, context-dependent matching of shared-type, and a bias toward leaf structure where much of schema content resides. Cupid shows how to utilize a mix between linguistic and structural matching in order to optimize the advantages of both techniques, even when compared to two other hybrid schema matching applications (DIKE and MOMIS). Even then, the authors admit that there are still many problems/open problems that is yet to be solved in Cupid.
Cupid still relies heavily on thesaurus supplied manually by users. Cupid would just do a repetition of match operation based on the completeness of the thesaurus. I think other than schema matching operation, the thesaurus could also be developed into something more complex that is able to suggest words that may be synonyms/hypernims, which derived not only from user input but from analyzing common usage (i.e. from the internet). Also, I wonder if there is a way to develop Cupid so that it is able to learn from previous schema matching operations, like hybrid between schema based and instance based.--
The purpose of this paper is to present a novel schema-matching algorithm named Cupid that uses two phases -- a linguistic matching phase followed by a structure mapping phase -- to more accurately perform schema matching when compared to existing methods. |
The technical contributions of this paper are numerous. In the linguistic matching phase, they perform better tokenization and thesaurus matching of terms and abbreviations than existing methods. They then use a structure-based scoring phase to evaluate how similar nodes in different schemas are to each other based on their relationships with other nodes. I think this latter contribution is one of the more major ones. It seems that existing methods do perform a large amount of linguistic matching, but their structure matching algorithm is novel in its approach. They are also presenting an algorithm which incorporates many of the different kinds of heuristics used by schema-matching algorithms which they claim is crucial for achieving an accurate mapping.
I think this paper has many strengths. I was grateful for their presentation of their structure matching phase first in the case where the schema is a hierarchical tree structure. Explaining in this reduced case first allows the reader to more easily follow the extension to general directed graphs later in the paper. Additionally, I think they do a good just of justifying the choices they make in the development of their algorithm that differ from what is canonically used in existing algorithms or that differ from the intuitive choice. One example of this is their discussion of why they choose to process their structure trees bottom-up rather than top-down, even though this comes at a computational expense. This adds to the strength of their development story and thus results.
I think one of the only weaknesses in this paper comes in the experimental evaluation section. I wish that there were more quantitative comparisons between Cupid and the other matching algorithms. It is information to report the data that they did, but having more quantitative metrics would allow me to see how much better Cupid is as an algorithm. Perhaps with a larger set of comparison experiments, the authors could consider a quantitative measure of Cupid’s performance, but on this small set, only the qualitative results aren’t very informative. I was also surprised in the section that calls for more user interaction in future work. I would think that, though user information can improve results yes, part of the benefits of Cupid is that it requires less user interaction and the more reasonable move would be towards an automated schema matching system, in my opinion.
Paper Review: Generic Schema Matching with Cupid|
Paper Summary: The content of this paper is bipartite. It first made a literature review in the field of schema matching. Then it proposes a novel technique to add to this pool.
In the first part the paper mentioned a number of existing methods aiming at this issue. The description of those methods is very objective and in some cases detailed.
The second part is the primary content, in which the paper proposes a novel method to be added to the existing ones. The proposed method is referred to as Cupid. As pointed out by this paper, most of the mentioned existing work are not a complete solution since they tend to only exploit at most a few of the techniques in the taxonomy. In contrast, the proposed model in this paper aims to be a generic method that can be used to solve the schema-matching problem. The model is in essence a combination of many approaches while attempting to keep each of their advantages.
The primary strength of this paper comes from the novelty and advantages the proposed model carries. As a combination of many approaches the proposed model inherits a number of favorable properties. The paper also does a good job in conveying the ideas of those properties. One question remains about the general approach is that as one approach that integrates a number of existing approaches, what is the intuition that guides how to balancing those approaches, i.e. how to aggregate all the feedbacks from different modules. Because intuitively the aggregation process can make a lot differences even when all the submodules performs the same.
Some defects of the paper are also noticeable. While claiming the proposed model has multiple favorable properties, there is not much of experiments to advocating those claims. It would have been way more convincing if some experiments are conducted to show that for some of the listed properties the model can leverage and outperform other existing models in some tasks. There are, indeed, some comparisons done in the paper. However the results are lack of numerical comparisons and thus are not exactly intuitive or convincing.
This paper proposed a generic Schema Matching Algorithm, Cupid, which can discover mappings between schema elements based on their names, data types, constraints and schema structures. Moreover, this paper also gives a complete introduction of all existing schema Matching algorithms, along with a complete taxonomy classification. At the end of the paper, it conducts several detailed comparisons between Cupid and 2 existing schema matching algorithms DIKE and MOMIS, and shows that Cupid has better precision and less false positive than the two other algorithms.
The Cupid Algorithm consists of three modules, the Linguistic Matching module, the Structure Matching module, and the Mapping generation module.
1. The Linguistic Module: Linguistic module consists of three steps: 1. the normalization step, where semantically similar schema element from abbreviations, acronyms, punctuations are normalized into a se of name tokens; 2. The categorization step, in which all single schemas are categorized to a set of categories to reduce number of elements to elements comparison. 3. The comparison step, where a name similarity is calculated to determine the score of the similarity between two elements.
2. Structure matching module: in the structure matching module, two schemas are evaluating by the structure level which utilized the linguistic similarity computed by the linguistic module for the comparison of each leaf and also for node. A post-order, or reverse topological order comparison is applied for the comparison between two schemas. Several optimization is also applied to improve the performance and also to improve the precision of the algorithm. For example, an special case of IsDerived From and a referential constraints mapping is added for a generic mapping model.
3. Mapping generation module: In this module, a leaf mapping is naïve, however a non-leaf mapping involves second run of post-order traversal of the two schemas to recomputed the similarities of non-leaf elements.
1.This paper introduces a generic schema matching algorithm, which tackles a very detailed but yet important problem in database system. This schema matching algorithm is very important for the future ETL workload, and has inspired several followers in this area.
2. This paper has a complete taxonomy of existing schema matching algorithm and also precisely define the problem of schema matching.
3. This paper conducts several experiments between Cupid and its peer algorithms, both with canonical examples and also with real word examples, which make its result very convincing. Moreover, this type of comparison is also a very good example for its followers and for scientific experiments.
1. Although this paper gives a complete experiment with cupid and two other schema matching algorithm, there is no comparison between cupid, as a schema-based algorithm and other instance based algorithms. It would be more interesting if more comparisons between schema-based algorithms and instance-based algorithms can be showed in this paper.
2. A small typo in this paper is very confusing while reading, in section 5.2, when it introduces the container category, it says containment is described in more detail in section 7.1, what it actually mean is in section 8.1.
This paper introduces a generic schema matching algorithm called Cupid. The schema matching is an important step in data integration. This paper gives an taxonomy of past approaches for schema matching, and proposed a new algorithm, Cupid, which leverages various information including attribute names, data types, constraints, and schema structure. Cupid applies a broader set of techniques than past approaches. We call the schema manipulation operation Match that identifies corresponding elements in a pair of schemas. Schema matching is a hard problem, as people can express identical concept with schemas of different naming and structure. To solve this problem, a variety of methods are proposed in the past. The authors classified them into several orthogonal criteria, including schema based vs instance based, element vs structure granularity, linguistic based, constraint based, matching cardinality, using auxiliary information, and individual match vs combinational matcher. Inspired by those approaches, this paper proposed a combinational algorithm that leverages many different schema matching techniques. It combined automated linguistic-based matching, element-based and structure-based matching. Cupid can generate 1:1 or 1:n mappings. |
Cupid has two main steps:
It first computes similarity coefficients between elements of the two schemas and then deducing a mapping from those coefficients. The coefficient is calculated by linguistic-matching and structure matching. Linguistic matching matches individual schema elements based on their names, data types, domains. The structural matching groups schema elements based on the similarity of their contexts and vicinities.
In the second step, it generates a schema mapping by choosing pairs of schema elements with
maximal weighted similarity.
The main advantage of Cupid is the integration of several previously proposed techniques, especially combining linguistic-based matching and structure-based matching. It is widely realized that schema matching is so hard a problem that no single technique can tackle it. It is the right direction to integrate complementary techniques. Another contribution of this paper is the evaluation model for comparing schema matching algorithms.
A weakness of this paper is lacking use of information provided by data instances. It would be feasible if Cupid can apply machine learning technique to calculate the similarity of data instances amount two schemas. Such information will be very helpful for making a decision on matching similar attributes.