HoloClean is a framework for holistic data repairing driven by probabilistic inference. The paper introduces the motivation for this framework and the optimizations applied to scale inference for data repairing, and studies the tradeoffs between the quality of repairs and runtime of HoloClean for those optimizations. The approach of HoloClean is to build a system to unify integrity constraints, external data, and quantitative statistics to repair errors in structured data sets. The main challenge is how to use probability to reason about inconsistencies across different signals used to repair data. The goal of HoloClean is to identify and repair erroneous records in a structured dataset, and the solution is to use three phases: error detection, compilation of probabilistic inference engine, data repairing with statistical learning and inference over data distributions. The main contributions of the HoloClean system include: (1) it includes a compiler that generates a probabilistic model for repairing a dataset (2) an algorithm that uses Bayesian analysis prune the domain of the random variables corresponding to noisy data in the input dataset (3) an approximation scheme is designed to relax hard integrity constraints. The experiment results provide the following takeaways: (1) Holoclean runtime is higher than competing methods, but provides more accurate results (2) Holoclean's domain pruning strategy is important to accurate repairs and high scalability of the system (3) relaxing denial constraints leads to more scalable models with higher quality repairs. I like this paper because of two reasons. First, it gives a good explanation to the problem we're dealing with, with examples that are easy to understand. Second, it provides comprehensive experiments, covering all concerns about the system: its runtime, accuracy of results, and the behaviors of different optimizations mentioned in the paper. The experiment results and the analysis are very reader-friendly and provide readers with insight into HoloClena's system behavior. After reading this paper, I'm very interested in the future extension mentioned in the paper: how data programming and data ceaning can be further unified. |
Modern data analysis processes data with high volume and variability. Therefore, it’s important to identify and repair inconsistency caused by incorrect, missing and duplicate data. Data cleaning theory is raised to solve those problems, which ensure the data adheres to desirable quality and integrity constraints. The paper introduces HoloClean, a framework for holistic data repairing driven by probabilistic inference. Given an inconsistent dataset as input, HoloClean will automatically generate a probabilistic program that performs data repairing. The paper also introduces optimizations to scale inference for data repairing and studied the tradeoffs between the quality and time for those optimizations. Some of the contributions and strengths of this paper are: 1. The paper provides lots of charts, algorithms and experiments to help the reader understand the concepts and implementations. 2. HoloClean unifies different signals for data repairing integrity constraints, external data and quantitative statistics instead of focus on one of them like previous systems. 3. The paper use Bayesian analysis to prune the domain of the random variables, which allows the tradeoff between scalability and quality of repairs systematically. Some of the drawbacks of this paper are: 1. Data cleaning in the paper is under a different probabilistic framework from data programming 2. The font size and style are not consistent. |
Data cleaning consists of error detection and data repairing, and many approaches have been explored for these two parts. However, these methods still perform badly on automatic data repairing. The reason is that they limit themselves to only part of signals including integrity constraints, external information and statistical profiling of the input dataset, and ignore additional information that is useful to identify the correct value of erroneous records. Therefore, this paper proposed HoloClean, a framework for holistic data repairing driven by probabilistic inference. The motivation is to do data repairing by combining all the aforementioned signals in a unified framework. HoloClean is the first data cleaning system that unifies integrity constraints, external data, and quantitative statistics, to repair errors in structured data sets. Instead of relying on each signal solely to perform data repairing, HoloClean uses all available signals to suggest data repairs. The main contributions of this paper are as follows: 1. This paper designed a compiler that automatically generates a probabilistic model for repairing a dataset, which supports different error detection methods, such as constraint violation detection and outlier detection. 2. This paper designed an algorithm that uses Bayesian analysis to prune the domain of the random variables corresponding to noisy cells in the input dataset. This algorithm helps systematically trade-off the scalability of HoloClean and the quality of repairs obtained by it. This paper also introduced a scheme that partitions the input dataset into non-overlapping groups of tuples and enumerates the correlations introduced by integrity constraints only for tuples within the same group. 3. This paper introduced an approximation scheme that relaxes hard integrity constraints to priors over the random variables in HoloClean’s probabilistic model. This relaxation results in a probabilistic model with independent random variables for which Gibbs sampling only requires a polynomial number of samples to mix. The main advantages of this paper are as follows: 1. HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. 2. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights from noisy, incomplete, and erroneous data. The main disadvantage of this paper is that the calculation of data repairing is both time-consuming and space-consuming. As I tried to replicate the experiment result, the execution of HoloClean on food dataset gave me a memory error, which means the data repairing must be executed a machine with huge memory. |
Problem & Motivation The process of ensuring that data adheres to desirable quality and integrity constraints (ICs), referred to as data cleaning, is a major challenge in most data-driven applications. Though many error detection methods achieve a high precision and recall, the start-of-the-art repairing methods behaves poor. And the methods depending on different approaches have their own cons. For methods depending on the minimality, the issue is that the minimal repairs is not equivalent to correct repairing and can be illustrated in Figure 1(E) given the correct zip code is also minority. The methods depending on the external data relies on the quality of the external data. Under the situations that there is not external data or poor-quality external data, the data repairing method achieves a poor quality. For techniques that leveraging quantitative statistics, they overlook the integrity constraints. Contributions: The authors propose the HoloClean – the algorithms to identify and repair all cells whose initial observed value is different from their true value on structured datasets. HoloClean solves the problem in a quantitively statistic way. HoloClean first detect the error and uses the variables with values that are not reported error as training data to learn a probabilistic model for repairing erroneous cells. To overcome the cons the traditional wisdom has, it also allows the user to specify the denial constraints and use external datasets or dictionaries. The logic is simple: if user specifies those items, then the system will try to repair data with denial constraints or external datasets; if it cannot repair, then it builds the quantitative statistic model and try to solve it. Another big part of the HoloClean is to make it more scalable. For naïve globally consistent model, it enumerates all the possible correlations and then infer the relationships. This is NP-complete problem. However, what the authors propose that if given the user wisdom, we can reduce the cost and make the system more scalable by introducing features over independent random variables where the features are specified by user wisdom. For example, if the user introduces the integrity as “the same university must be in the same state”, then the system will create the feature f(university, state) and vote for the result. Drawbacks This algorithm heavily depends on the user wisdom. Without the user wisdom, it makes not difference than other systems. In other words, it makes little progress on auto-repairing part – the statistic model part. |
Data cleaning is a laborious and time intensive process that often takes up a significant amount of time, due to malformed data and other realities of data collection, like semantic heterogeneity. Since researchers and others are more interested in working with the data to generate insights rather than manually fixing data to make it well formed, the case for an automatic method to perform these sorts of tasks is very strong. This paper introduces a framework called HoloClean that performs holistic data repairing driven by probabilistic inference. At the time, many of the methods in use relied on integrity constraints or external data sources, which suffer from limitations in many cases when external sources do not have good coverage, or if there is insufficient knowledge for a given domain. According to the authors, HoloClean’s use of quantitative statistics addresses many of these weaknesses in that its probabilistic model can reason about how to resolve inconsistencies. HoloClean is designed with the goal of identifying and repairing erroneous records in a structured dataset D, which can be characterized by attributes A1, A2, … AN. Given this dirty dataset D along with a set of available repairing constraints, HoloClean works using the following three steps: 1. Error detection – HoloClean separates D into noisy and clean cells 2. Compilation – Using the observed cell values and repairing constraints, HoloClean uses probabilistic semantics to express the uncertainty over the value of noisy cells. They are represented as factor graphs. 3. Data repairing – HoloClean uses statistical learning and inference (empirical risk minimization) over the joint distribution to compute the marginal probability that maximizes the probability of a given variable (i.e. make the repair). HoloClean uses a variety of signals (denial constraints, external data, matching dependencies, minimality principle, etc.) and compiles them to a DDlog program that defines the factor graph used to repair the input dataset. This DDlog program contains rules that capture the quantitative statistics of the dataset, rules that encode matching dependencies over external data, rules that represent dependencies due to integrity constraints, and rules that encode the principle of minimality. The compilation starts by generating relations used to form the body of DDlog rules, followed by using those relations to generate inference DDlog rules that define its probabilistic model. Once this is done, HoloClean grounds the DDlog and runs Gibbs sampling for inference. Some optimizations are necessary to allow the DDlog to scale better for a large number of relations, as well as to make the Gibbs sampling mix faster. The primary strength of this paper is that it introduces a method that improves on the previous state of the art ones in an increasingly important area, data cleaning. According to the experimental results presented by the authors, HoloClean outperforms all of the other methods it is compared to (Holistic, KATARA, and SCARE) in terms of F1-score, as well as speed in some cases. Additionally, since the datasets being tested came from real-world situations and are arguably quite semantically complex in some cases, it adds support to the possibility of being able to meaningfully use HoloClean in real-world settings. One weakness of HoloClean, addressed by the authors in many regards, is the question of how well it is able to scale to large datasets. While they did some optimizations to help in that front, it remains to be seen just how well it can handle web-scale datasets. It would also be interesting to see cases where it fails, since erroneously correcting data and working on it could be potentially more dangerous than just excluding malformed data in the first place. |
Data cleaning is a laborious and time intensive process that often takes up a significant amount of time, due to malformed data and other realities of data collection, like semantic heterogeneity. Since researchers and others are more interested in working with the data to generate insights rather than manually fixing data to make it well formed, the case for an automatic method to perform these sorts of tasks is very strong. This paper introduces a framework called HoloClean that performs holistic data repairing driven by probabilistic inference. At the time, many of the methods in use relied on integrity constraints or external data sources, which suffer from limitations in many cases when external sources do not have good coverage, or if there is insufficient knowledge for a given domain. According to the authors, HoloClean’s use of quantitative statistics addresses many of these weaknesses in that its probabilistic model can reason about how to resolve inconsistencies. HoloClean is designed with the goal of identifying and repairing erroneous records in a structured dataset D, which can be characterized by attributes A1, A2, … AN. Given this dirty dataset D along with a set of available repairing constraints, HoloClean works using the following three steps: 1. Error detection – HoloClean separates D into noisy and clean cells 2. Compilation – Using the observed cell values and repairing constraints, HoloClean uses probabilistic semantics to express the uncertainty over the value of noisy cells. They are represented as factor graphs. 3. Data repairing – HoloClean uses statistical learning and inference (empirical risk minimization) over the joint distribution to compute the marginal probability that maximizes the probability of a given variable (i.e. make the repair). HoloClean uses a variety of signals (denial constraints, external data, matching dependencies, minimality principle, etc.) and compiles them to a DDlog program that defines the factor graph used to repair the input dataset. This DDlog program contains rules that capture the quantitative statistics of the dataset, rules that encode matching dependencies over external data, rules that represent dependencies due to integrity constraints, and rules that encode the principle of minimality. The compilation starts by generating relations used to form the body of DDlog rules, followed by using those relations to generate inference DDlog rules that define its probabilistic model. Once this is done, HoloClean grounds the DDlog and runs Gibbs sampling for inference. Some optimizations are necessary to allow the DDlog to scale better for a large number of relations, as well as to make the Gibbs sampling mix faster. The primary strength of this paper is that it introduces a method that improves on the previous state of the art ones in an increasingly important area, data cleaning. According to the experimental results presented by the authors, HoloClean outperforms all of the other methods it is compared to (Holistic, KATARA, and SCARE) in terms of F1-score, as well as speed in some cases. Additionally, since the datasets being tested came from real-world situations and are arguably quite semantically complex in some cases, it adds support to the possibility of being able to meaningfully use HoloClean in real-world settings. One weakness of HoloClean, addressed by the authors in many regards, is the question of how well it is able to scale to large datasets. While they did some optimizations to help in that front, it remains to be seen just how well it can handle web-scale datasets. It would also be interesting to see cases where it fails, since erroneously correcting data and working on it could be potentially more dangerous than just excluding malformed data in the first place. |
The purpose of this paper is to introduce the HoloClean tool for cleaning data by using probabilistic methods based on the input data properties. They define data cleaning as ensuring that your data lives up to a standard of quality and integrity constraints (ICs). Current research and industrial efforts are going towards error detection, which is detecting incorrect, missing, or duplicate data, and data cleaning, which is removing these errors. The state of the art automatic error detections are mainly based on violations of integrity constraints or duplicate and outlier detection. Data repairing techniques are based on several signals: statistical profiling on some input data, knowledge bases and human assistance, and integrity constraints. For data repairing the author introduces the idea of minimality, where given two candidate sets of repairs, we prefer the one with the fewest changes from the original data, that preserves integrity constraints in your updates. The author then introduces HoloClean, which combines integrity constraints, external data, and quantitative statistics to repair data. Each of these signals are considered evidence of the correctness of the data records in the input dataset. HoloClean generates a statistical model for your data where the random variables encode uncertainty in the input dataset and the signals are represented in features of the graphical model. Combining all three signals showed over a 2 times improvement over methods using only one of the three. The author gives background on probabilistic inference problems, which consist of grounding, which is generating a factor graph for the joint distribution of all the variables and inference, which is computing the probability over each random variable. Challenges that come with this paper’s contribution were the intractability of integrity constraints that spanned too many attributes, avoiding evaluating integrity constraints for tuple pairs that cannot result in a violation because the factor graph can quickly become quadratic in size, and since probabilistic inference is is a #P complete problem, so we need to find scalable probabilistic models by using approximation methods to tradeoff computational efficiency for accuracy. The author had three main contributions. The first contribution was developing a compiler that supports multiple error detection methods and automatically generates a probabilistic model for data repair. The second was using Bayesian analysis to prune the domain of the random variables representing noisy points in the input dataset, which allows the user to tradeoff scalability for quality of repairs returned. The third and final main contribution was their approximation method that relaxes hard integrity constraints in the probabilistic model, resulting in independent random variables in the model, which can be mixed with only a polynomial number of samples. The HoloClean solution takes in a dirty database and a set of constraints, and runs three steps to clean that data. The first step is error detection, which HoloClean treats as a black box where the user determines which data points are erroneous. The next step is compilation, where it determines some uncertainty over the erroneous cells identified in the first step. The third step is the data cleaning, where an objective function is maximized that maximizes the probability of all of the random variables. I liked the level of detail that this paper went into, it gave a sufficient amount of statistical background and made sure the reader thoroughly understood the statistical and machine learning concepts that this paper was built on. I did not like however how I was sort of lost on the example in Figure 1 and in the text, showing how to clean data based on the different signals, towards the front of the paper. This sort of confused and slowed me down early on in the paper. |
HoloClean: Holistic Data Repairs with Probabilistic Inference This paper proposes Holoclean which is Holistic data repairs with probabilistic inference. The core of Holistic data repairs consists of data cleaning and data repairing. For data repairing, state-of-art method adopts a variety of signals such as integrity constraints, external information(for instance, dictionaries, knowledge bases, and annotations by human experts), and statistics profiling of the input dataset. However, state-of-art methods have bad performance on real-world datasets. The reason for this is that these methods limit themselves to only one of the aforementioned signals, and ignore additional information that is useful to identify the correct value of erroneous records. Overall speaking, there are a lot of information that could be used which haven't been used for the data repairing. HoloClean has done that. HoloClean is a framework for holistic data repairing driven by probabilistic inference which unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. The contribution of this paper is the performance of data repairing which is twice better the previous methods. However, considering that this paper actually implemented the combination of existing methods in which I think it does not have real contribution. The advantages of this paper: HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights from noisy, incomplete, and erroneous data. The disadvantages of this paper is that this paper doesn’t propose anything new, just combine the existing methods.the basic idea is using the holistic methods to do data repairing which is time consuming and space consuming. And this paper cleverly select the dataset that best suit its model which I think is not a good thing. |
In recent years, the widespread existence of massive data has made the importance of data quality once again mentioned. Based on this, a large number of data cleaning methods have been proposed, including error detection and repair of data based on defined integrity constraints, introduction of external knowledge (such as knowledge base, dictionary, consult experts) to match and repair error information, and use quantitative statistics. Perform methods such as outlier detection and repair. However, because these methods only repair data based on a single source of information, they do not guarantee the comprehensiveness and reliability of the fix. Therefore, in order to make the repair results more accurate and reliable, this paper proposes a framework to comprehensively repair data HoloClean, which is based on the integration of multiple data cleaning methods, using a variety of information given by a variety of methods to generate errors. The best fix. The HoloClean framework uses compilation to generate a probability map model and integrates multiple data cleaning methods into the probability map model as repair signals. These data repair signals include: quantitative statistics, external data,integrity constraint dependence and the principle of minimum repair. HoloClean derives the weighting of the different repair signals mentioned above by training a small number of data sets, thus producing the most suitable repair for each error. Since the repair result of HoloClean is the result of a combination of various data cleaning methods, the repair effect is better than the single data cleaning method. Experiments show that the average accuracy of the method can reach 90%, the average recall rate can reach 76%, and the average F1 is twice that of other methods. HoloClean can customize the weight of each data repair signal to make it more adaptable to different types of data repair tasks. The main contribution of holoclean is that the paper design a compiler that automatically generates a probabilistic model for repairing a dataset. More, the paper design an algorithm that uses Bayesian analysis to prune the domain of the random variables corresponding to noisy cells in the input dataset. More over, the paper introduce an approximation scheme that relaxes hard integrity constraints to priors over the random variables in HoloClean’s probabilistic model. |
HoloClean describes a more holistic data cleaning system than ones that have come before. Data cleaning is the process of taking a dirty dataset with some “errors” and turning it into a clean version. There are two steps, detecting and correcting errors, and HoloClean primarily focuses on the latter. Previous techniques had used integrity constraints, external data (knowledge bases), or statistical models. Alone, these achieve okay results, but HoloClean puts them together with the goal of achieving higher F1 scores than current methods, which is on average .35. HoloClean first uses existing error detection methods to find noisy and clean cells - in the results section, the method introduced in the Holistic paper (by some of the same authors) is used. Users specify denial constraints (a type of integrity constraint) and can provide external datasets to aid the data cleaning process. The three techniques previously mentioned are combined to become signals in a probabilistic model to compute fixes for the data. The principle of minimality is used - the idea is to make the fewest changes possible to fit certain constraints. The signals are fed into the DeepDive inference engine to determine fixes for the data. Finally, the paper discusses some optimizations, including pruning at various stages, as the inference is costly on large datasets. The idea of combining multiple systems together is not too exciting on its own, but the authors manage to do it very well. The need for this is very clear - the examples on page 2 clearly indicate why the previous single methods did not work very well. Furthermore, the way that the authors introduce a system, explain why it is computationally expensive, and then explain how they are able to make it feasible on large datasets is a good model for describing this sort of problem. The results clearly show that HoloClean outperforms previous methods, although I do wonder if there are datasets that they would not perform well on, and would like to see that discussed. One downside to this paper is that there seem to be some holes and gaps in the results section. The paper only very briefly mentions fusion (using knowledge about sources to determine which rows are correct), but it seems to be integral to the performance on the flights dataset, which is noisy due to the presence of many sources. Furthermore, they also briefly mention that they rarely use external knowledge-base data, which is a surprise given that it is focused on quite a bit in the paper. Finally, I found the statement that “KATARA performs no repairs due to format mismatch for zip code” to be a bit of a cop-out. Was it really impossible to fix that format mismatch to get a baseline? |
In the paper "HoloClean: Holistic Data Repairs with Probabilistic Inference", Theodoros Rekatsinas and Co. discuss the implementation of HoloClean, a framework for holistic data repairing that is driven by probabilistic inference. In the real world, data collection and data entry are not perfect processes. Data itself has a measure of quality and this degrades if information is missing, duplicated, or incorrect. As a result, there is great research and effort poured into the process of data cleaning and preserving its integrity. Data cleaning is separated into two tasks: detecting errors and repairing data. It is greatly desired for both to be automatic and have a high precision. Thus, we have HoloClean. A probabilistic program is automatically generated by HoloClean. Using already existing approaches to repair data, HoloClean unifies all these methods while also leveraging the statistical properties of data. Furthermore, HoloClean is able to scale well to magnitudes of up to millions tuples with an average precision of up to 90%. Thus, one could imagine the great benefits that HoloClean can deliver when cleaning data - especially unstructured data. Thus, it is clear that this is an interesting and beneficial framework to explore. This paper is divided into multiple sections: 1) Previous works (and their faults): Many previous works use and rely on integrity constraints. These methods use the principle of minimality and make big assumptions such as assuming that the majority of the input data is clean. This can potentially lead to incorrect repairs on data and violate the initial rule of minimal repair. Some other previous works use a different method: they rely on external data and make matches within dictionaries or knowledge bases in order to detect and repair errors. This method generally works well, but fails to repair errors that don't exist within the knowledge base. The last method that is used involves heavy statistical inference and uses the co-occurrences of attributes and their distributions to detect and repair errors. This method, however, overlooks integrity constraints. Furthermore, much like the knowledge base, if it doesn't gather enough information, it cannot properly repair data. HoloClean takes all these previous approaches and combines them into one in order to achieve high precision. 2) HoloClean Framework: There still exist some challenges in implementing HoloClean. Creating a probabilistic model that provides a means for unifying all signals is not trivial - it may not scale well for larger input. To further exasperate this, integrity constraints can span multiple attributes, introduce correlations between pairs of random variables associated with tuples in the input dataset, and in the presence of complex correlation, probabilistic inference in #p-complete. Thus, HoloClean takes a careful approach to data cleaning using three simple steps: error detection, compilation, and data repairs. In error detection, cells are separated into noisy and clean cells and are treated as black boxes. Users are allowed any method they want to detect errors. In compilation, HoloClean derives a probability distribution over these noisy cells and marks each of them with the likelihood of incorrectness. Lastly, in data repairing, a learning agent decides which cells to fix based on the probability distribution. After the repairs are done, it asks the user for feedback on example cells so that it can improve itself. 3) How well does HoloClean Scale: In order to help in dealing with the combinatorial expansion of complex probabilistic models, we need to prune results that are not beneficial to our search. Thus, HoloClean prunes the domain of random variables which leads to a systematical trade-off between the scalability of HoloClean and the quality of repairs obtained by it. HoloClean also introduces a trade-off between the precision and recall of repairs output - this ends up being a necessity for HoloClean on large datasets. Furthermore, tuple partitioning before grounding also optimizes speeds of up to 2 times while only sacrificing 6% accuracy. Much like other papers, this paper also had some drawbacks. The first drawback that I noticed was a huge deal the authors of the paper made about time. I feel that in terms of use cases of data cleaning tools, a user would simply use the data cleaning application on their database and let it run for however long it takes. Time doesn't seem like a big deal as they made it out to be. Another complaint that I had was that I felt that this paper was a bit too much on the technical side. I know that to some extent, technical details need to be present within the paper, but it seemed like a major over-complication of previous and current work. The final drawback that I noticed is the use of their datasets within their experiments. I felt like typical datasets within those fields did not have as many errors as they claimed to have - records in the medical field would especially amount to errors of less than 1%. Unless they manually went through millions of tuples in order to verify their claims, I am not convinced. One final question that I have: if you take the same dataset and feed it into HoloClean iteratively multiple times, will it eventually be fixed of all errors? |
This paper describes HoloClean, a program for probabilistic data cleaning. A data cleaning algorithm is one that performs two functions: error detection, which is determining inconsistent data in a dataset; and data repair, which is changing the inconsistent data to correct the inconsistencies. HoloClean analyzes the data in with respect to three elements: Any integrity constraints on the data. Any external data that it should match. Statistics on the data. By combining all of these, HoloClean is able to fix large amounts of dirty data. HoloClean needs to prune the combinations of constraints that it considers; otherwise, the combinations will grow too quickly for the system to handle, as there are too many correlations between tuples in a large dataset. HoloClean is split up into error detection, compilation, and data repair. It also issues a confidence level for each repair, so that low-confidence repairs can be manually checked by the user. It represents the set of data as a factor graph, and aims to determine the optimal weight on the edges to maximize the probability of the current dataset occurring given the graph. The user specifies the constraints and the factor graph, and then HoloClean translates this into a set of DDlog rules. Each part of the data fills up a DDlog relation. There are DDlog rules for the integrity constraints, rules for the external data matches, and rules for the statistics derived from the data. The positive aspects of this paper are the focus on the effectiveness of it data cleaning, as well as its justification. Combining several methods of cleaning to generate a single, more effective method makes HoloClean a useful tool. In addition, HoloClean is directly compared to three other cleaning algorithms to properly demonstrate its capabilities. On the negative side, several of the experiments don’t produce any useful results, as they end up failing for various reasons. This does help demonstrate the benefits of HoloClean, which doesn’t fail, but it makes it harder to make good comparisons. |
This paper introduces HoloClean, which is a framework for holistic data repairing driven by probabilistic inference. HoloClean leverages statistical properties of the input data to qualitatively repair and unifies existing data. This paper also introduces a series of optimizations which ensure that inference over HoloClean’s probabilistic model scales to instances with millions of tuples. Data cleaning is a major challenge in data-driven applications, because data analytics need to identify and repair inconsistencies caused by incorrect, missing, and duplicate data. Data cleaning is mainly separated into two tasks: error detection and data repairing. The goal of HoloClean is to identify and repair erroneous records in a structured dataset assuming that error in dataset occur due to inaccurate cell assignments. The solutions of HoloClean involves detecting cells in dataset with potentially inaccurate values, which separates data into noisy and clean cells, following probabilistic semantics to express the uncertainty over the value of noisy cells, and running statistical learning and inference over the joint distribution to compute the marginal probability and assigns values that maximizes the probability of variable. HoloClean uses empirical risk minimization over the likelihood to compute the parameters of its probabilistic model. HoloClean compiles all available signals, including denial constraints, external data, matching dependencies, and the minimality principle to a DDlog program. DDlog contains rules that capture quantitative statistics of dataset, rules that encode matching dependencies over external data, rules that represent dependencies due to integrity constraints, and rules that encode the principle of minimality. These rules define a random variable relation, which assigns a categorical random variable to each cell. The paper gives examples of how HoloClean generates inference rules to encode the effect of features on the assignment of random variables. The grounding rules are quantitative statistics, external data, dependencies from denial constraints, and minimality priors. There are two aspects that affect HoloClean's scalability: random variable with large domains and factors that express correlations across all pairs of tuples in dataset. There are two optimizations for the two aspects: pruning the domain of random variables in HoloClean's model by leveraging cooccurrences statistics over the cell values in D, and pruning the pairs of tuples over which denial constrains are evaluated. There is also an optimization that relaxes the DDlog rules to obtain a model with independent random variables instead of enforcing denial constraints for assignment of the random variables corresponding to noisy cells in dataset. The paper also explains the experimental setup and evaluates the results. The advantage of this paper is that it gives a lot of examples to illustrate how data cleaning is being done in HoloClean which is very clear. It also explains terminologies before diving into technical details. However, the background is too long comparing with content related with HoloClean, which makes me lose focus when reading the paper. |
“HoloClean: Holistic Data Repairs with Probabilistic Inference” by Rekatstinas et al. presents a new multi-pronged approach for repairing erroneous data; they combine integrity constraints, external data, and quantitative statistics. Prior work focused on only one individual technique. |
This paper proposed HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. HoloClean is actually built upon a weak supervision paradigm and leverages diverse signals to repair erroneous data, including user-defined heuristic rules (such as generalized data integrity constraints) and external dictionaries. The key contribution of HoloClean is building a holistic data cleaning framework that combines a variety of heterogeneous signals, such as integrity constraints, external knowledge, and quantitative statistics, in a unified framework. It is the first data cleaning framework driven by probabilistic inference. Users only need to provide a dataset to be cleaned and describe high-level domain specific signals. Besides, It can scale to large real-world dirty datasets and perform automatic repairs that are two times more accurate than state-of-the-art methods. I really like this paper. I think the idea of treating dirty data as observed variable and corresponding true value as latent variable is really novel. Also, building a probabilistic model over dataset seems interesting and promising, the empirical result also seems satisfactory. Technically, the paper introduced several ways to scale the inference process, including Pruning the Domain of Random Variables, Tuple Partitioning Before Grounding, and Guarantees O(n log n) iterations for Gibbs sampling. I think there are practical ways to make the computation complexity lower. The shortcoming of the paper is that I don't see an analysis on how the size of dataset or the distribution of different cells will influence the performance of HoloClean. It is trivial to say that HoloClean won't perform that will on a small dataset, where true values can even don't appear in the whole dataset. |
In this paper, the author proposes a novel data repairing approach called HoloClean. HoloClean is a framework for holistic data repairing driven by probabilistic inference. It unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage the statistical properties of the input data. Design new data cleaning and repairing techniques is definitely an important issue in data integration field. Nowadays, people have more data to be processed and stored at DBMS systems, mistakes are always happening in such a scenario where humans are involved in operating of DBMS. It is important to find these consistent data and fix them automatically, so it is necessary to develop new algorithms to do this job. This is why the HoloClean is introduced. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Next, I will summarize the crux the design of HoloClean with my understanding. The HoloClean is inspired by recent theoretical advances in probabilistic inference, they introduced a series of optimizations which ensure that inference over HoloClean’s probabilistic model scales to instances with millions of tuples. Generally, the data cleaning can be divided into 2 parts, error detection, and data repairing. They first discussed some previous work for data cleaning. Compared to previous work, HoloClean is the first data cleaning system that unifies integrity constraints, external data, and quantitative statistics, to repair errors in structured data sets. Instead of relying on each signal solely to perform data repairing, they utilize all of the signals. HoloClean automatically generates a probabilistic model whose random variables capture the uncertainty over records in the input dataset. Signals are converted to features of the graphical model and used to describe the distribution characterizing the input dataset. To repair errors, HoloClean uses statistical learning and probabilistic inference over the generated model. As a result, in their experiment, they show that HoloClean scales to instances with millions of tuples and find data repairs with an average precision of ∼ 90% and an average recall of above ∼ 76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2× against state-of-the-art methods which is very impressive! Generally speaking, this is a nice paper with great insight, the main technical contribution of this paper is the introduction of HoloClean in which they finish data repairing in a more comprehensive and reliable way. The other contributions include an automatically generates a probabilistic model for repairing a dataset. They design an algorithm that uses Bayesian analysis to prune the domain of the random variables corresponding to noisy cells in the input dataset. They also introduce an approximation scheme that relaxes hard integrity constraints to priors over the random variables in HoloClean’s probabilistic model. I think there are several advantages to this paper. First of all, it is the first work that use all available signals to suggest data repairs, including integrity, constraints, external data and statistical profiles. Also, another big advantage of this method is that it completely reduces the requirement of a domain expert, there is no human work required when using HoloClean. Besides, I think this paper is well organized, they give a clear description of the system design of HoloClean and rich examples in their paper, which make it easy to follow and understand. From my point of view, the downsides of this paper are minor, I think they do a good job on HoloClean and I do not find any major disadvantages. One problem I find with the HoloClean is about is efficiency, as they show in their experiment, we can find that the runtime of HoloClean is much larger than other baselines, I think this is not so good for online error detection and correctness, HoloClean is only good for offline batch cleaning. Another problem is that HoloClean may not easily extend to handle a different kind of workload, for example, it is hard to extend HoloClean to stream data error detection due to the constraint of the model itself. I think people can try to optimize this field in future work. |
This paper introduces HoloClean, which is a framework for data repair that leverages an array of existing data repair approaches along with statistical strategies to achieve much better results than the then-current state-of-the-art. The paper starts by explaining how current strategies such as using just integrity constraints or qualitative/human-supplied knowledge or even statistical analysis is not sufficient for an automatic data repair service -- HoloClean instead unifies all of these approaches in one framework. It refers to each of these approaches as “signals” and uses probability theory — particularly pruning via Bayesian analysis — to combine the “evidence” from these signals to deal with inconsistency that each approach might have with each other. An extensive experimental section is given using datasets from hospitals, flights, food & physician info and HoloClean is shown to provide much higher precision than both of the cleaners it was tested against. The paper’s contribution is obviously HoloClean, which achieves great experimental results. In terms of the paper’s strengths, I think its biggest strength is the way that it introduced its problem/concepts/etc through a very concrete example — this made reading the paper much easier and the flow was really good as a result. One possible weakness is that some of the “technical challenges” were stated very matter-of-factly without a lot of explanation; I ended up just accepting that these were true but was not entirely confident about the background for some of these, especially the advanced statistics - related info. |