Modern data warehouses contain enough data that computing exact results to simple join queries can take several minutes or more. For example, a user may want to find out what the total sales were in a given period of time for a particular region, which might require a series of foreign key joins to determine, followed by an aggregation; if the data is large enough, this operation could be time consuming to run to completion. Sampling allows a database to quickly return approximate results for a query using only a subset of all relevant data, but sampling is difficult for join queries. This is because each sequential join reduces the expected size of the result exponentially, and joining on pre-sampled data can bias the results.|
The authors present a scheme to produce unbiased join synopses via sampling, leading to efficient approximate query answering for queries on the result of a series of foreign key equijoins. The authors’ system pre-computes the result of each such join, then stores a uniform random subset of the result, taking advantage of a tree representation of sequential foreign key joins to reduce the amount of pre-computation needed. Each pre-computed join sample can be updated efficiently as new tuples are added to its input tables, by checking if the tuple would be included in the full join result, and if so, adding the new result tuple to the summary with some probability.
The main contributions of the work are a method for efficiently pre-computing unbiased sample results from a series of foreign key joins, along with a proof that this method is unbiased, and an implementation in the AQUA database system. AQUA (approximate query answering) sits on top of Oracle and allows users to request estimated answers to queries from pre-computed samples, via a graphical user interface.
One limitation of this work is that the authors only support sequential foreign key equijoins. Moreover, the work assumes that there will be enough space to store join synopses for every possible foreign key equijoin, which may not be reasonable in an environment with many tables. In such a case, it would be necessary to have some intelligent method for selecting which join synopses to produce and store.
The paper focuses on the concept of Join Synopses for Approximate Query Answering in large data warehousing environments. The authors discuss the difficulties in the traditional approach and propose the new techniques for evaluation and maintenance of Join Synopses. The traditional query approach gives an exact result for the query but sometimes even a close answer does the job. Schemes for providing approximate answers that rely on basic relations alone suffer from serious disadvantages. Use of pre-computed small sets of distinguished joins makes the work a lot faster. This gave the authors motivation for developing a system to mitigate the slow functioning of the ongoing system.|
The paper talks about the Aqua system whose goal is to improve response times bu avoiding accesses to the original data. Maintenance of small synopses of various samples and histograms are handled by this system. Its components include Statistics Collection, Query rewriting and Maintenance. The problem with joins come when they work on non uniform result samples and operate on small join result sizes. A practical and effective solution for producing approximate join aggregates of good quality is provided by Join Synopses. It pre-computes samples of join results, making quality answers possible even on complex joins. The authors say that by computing samples of the results of a small set of distinguished joins, we can obtain random samples of all possible joins in the schema. We refer to samples of these distinguished joins as join synopses. They also present optimal strategies for allocating the available space among the various join synopses when certain properties of the query work load are known and identify heuristics for the common case when such properties are not known.
The paper is successful in providing a comprehensive study of Join Synopses. The experimental results show that it performs almost two orders faster than the traditional methods. In today’s world of data warehousing, such a technique is important to provide high performance to the distributed systems. It is also observed that join synopses can be maintained efficiently during updates to the underlying data.
However, the paper doesn’t talk about accurately approximating answers to operations such as group-by, rank and set valued queries. It only talks about a part of aggregate queries, but doesn’t take into account why and how the problem with other types of queries can be solved. The math for developing Space allocation seemed incomplete in the paper.
In large data warehousing environments, it is often advantageous to provide fast, approximate answers to complex aggregate queries based on statistical summaries of the full data.|
However, providing good approximate answers for join-queries using only statistics (in particular, samples) from the base relations is difficult. In this paper they propose join synopses as an effective partial solution for this problem.
There are two contributions of this paper. The first one discover the problem of approximate join query, and I think it the most critical one. The second is connecting foreign key join to source table. With this observation the uniform sample can be achieved by sampling source table and join the rest of the table. Moreover the notion of maximum join enable Join synapses to answer other join query by applying projection on maximum join. The maintain method is quite straight forward. The paper present efficient algorithms for incrementally maintaining these sampling.
The evaluation is pointless, given the theoretical result. The authors spend large portion to discuss different error bound and subsampling method to make the error even smaller. It's quite subtle, since those methods are only different by constant fraction as the sample size increase.
This paper presents a new method for approximating query answers called Join Synopses, which is an effective solution for schemas that involves only foreign-key joins. In short, join synopses is the samples of the results of a small set of distinguished joins, and we can obtain randoms samples of these distinguished joins. Join synopses can provide better performance than schemes based on base samples for computing approximate join aggregates. Also, it can be maintained efficiently during updates. Also, it can provide empirical confidence bounds. This paper first discusses the problem with joins and overview about the Aqua system. Then it provides details of join synopses, how heuristic allocation strategies work, and a novel empirical technique to compute confidence bounds. Finally it presents performance evaluation, related and future works.|
The problem here is that we need fast, approximate answers to complex aggregate queries in large data warehousing environments, which is different from traditional query processing providing exact answers. There are many situations that we need fast and approximate answers such as the base data being unavailable, query requesting numerical answers, and query optimizer requiring a fast estimate plan cost. Furthermore, there are several problems with joins. First, the join of two uniform random base samples is not a uniform random sample of the output of the join. Second, small join result sizes can lead to inaccurate answers with poor confidence bounds.
The major contribution of the paper is that it provides a good solution for providing good approximate answers for join-query, and it has better performance than schemes based on base samples. Also, they provide optimal strategies for allocating the available space among the various join synopses for both known and unknown work loads. They also present efficient algorithms to maintain join synopses during updates to the base relations.
One interesting observation: this paper is very innovative, and the idea of join synopses is very interesting. It also provides detailed performance evaluation on a well-known benchmark TPC-D, and it shows that join synopses provide good performance. One possible weakness it that this paper only mentioned how join synopses work well for foreign-key joins. It might be better to discuss why it would not work for other joins.
This paper covers a problem in DBMS warehouse environments. Most of the time, users are not looking for exact answers. Rather, they are looking for patterns or trends, or just poking around to see what they should investigate further. For these cases, it is not necessary to provide answers to a high number of decimal places, or provide an exact COUNT() or SUM() - instead, it is better to provide a "close enough" answer, more quickly. Another interesting area is actually query optimization - you don't need to know how expensive a plan is exactly, just how expensive it is compared to other plans.|
This paper attempts to cover a specific problem in approximate query answering - using join synopses (join statistics, in the case of the paper, samples). The authors arguing that a bit of pre-computation - just one join synopsis for each relation - can dramatically improve the quality of approximate answers for arbitrary queries with foreign key joins. Incoming queries are then rewritten to use the join synopsis. The join synopses can also be maintained fairly easily in the presence of updates. Essentially, we add a new tuple to the join synopsis with some probability p, and then insert it into any other join synopsis that it needs to be in for multi-dimensional joins. Because we only update with some probability p, the maintaining the join synopsis is dramatically reduced. Of course, updates could always just be applied in batch ( offline), and this entire problem could be removed.
One good thing about this paper is it is very focused. It its only concerned with joins for approximate queries - unlike other papers which attempted to cover far too much information in far too few pages. This means that the authors actually have enough space to full cover their ideas.
However, this paper is very dense, and tough to read. The complex mathematical notation may be a bit beyond the average reader, and doesn't really add much to the paper as a whole. They express things in notation that could have better been explained in english or pseudocode.
Problem and solution:|
The problem proposed in the paper is that different from the traditional query processing who provides exact answers to queries, currently the large data warehousing needs only approximate but fast answers to complex queries. It is because the processing time for the exact answer is very long in the warehousing. So the accuracy can be sacrifice to gain faster response and large throughput by approximate the answers based on the statistics. The way is to solve it with approximate answers. However, it is difficult to provide good approximation for join queries using the data of base data. The quality of approximation is poor with more than two relations in the samples. So the solution proposed is to use join synopses, which targets in the approximate of queries to the data warehousing environment since the tables are connected via foreign-key relationships. Join synopses are the precomputed samples of a small set of different joins and can improve the answer accuracy.
The main contribution of join synopses is that it shows a way to make high accuracy approximate for foreign-key join queries and only a few join synopses are needed. The system performs an optimal allocation of space according to the base statistics to minimize the error rate in approximation. In join synopses, it also provides confidence bound of the approximate answers as a feedback so that the user can estimate the reliability of the answer. The confidence bounds are computed by extracting sub-samples from samples. All of those procedures provide a high quality estimate of the foreign-key join aggregates.
The paper is quite good since it specifically solve a difficult problem. But I think there is still some weakness. One is that the confidence is calculated from the extracted sub-samples of samples. I wondered whether the result be not accurate enough and give the users a wrong feedback since the way to compute the confidence ignores a lot of data.
In approximate query answering, when it comes to joins, using random base samples alone can cause inaccurate answers and degraded confidence bounds. In this paper, join synopses is introduced to solve this problem in foreign key joins. Join synopses is the output of foreign key joins between a random sample of the source relation and its descendant relations. Based on the theorem, join synopses is a uniformly random sample to the output of the k-way foreign key join.Hence, to approximately answer a foreign key join, we need to simply access a join synopses. For the real workload, several heuristic strategies for space allocation are introduced, such as Enjoin, CubeJoin, and ProJoin. The strategies try to minimize the error bound under limited total disk space for storing samples. In addition, join synopses can be maintained incrementally, which reduces maintenance overhead when exposed to workload with updates.|
Join synopses support arbitrary foreign key joins. To get some foreign key join result, it can simply query the samples and project out the unnecessary attributes. It is very fast since we do not need to touch the large original data.
1.Join synopses does not give benefit in answering queries of non-foreign key joins but still has storage overhead. For the arbitrary joins, the foreign key constraints usually become unrelated and techniques using the random samples of each relation is more applicable.
2.Though the paper propose three simple space allocation techniques: EqJoin, CubeJoin, and ProJoin, yet the discuss of it is relatively brief and I believe developing a more effective and complicated allocation approach is possible.
Approximate query processing has traditionally included sampling base relations, and running an approximate query on the samples. However, this technique produces sub-par results for queries with join. Joining sampled relations does not result in a uniform sample of the joined table, because every tuple in the joined table does not have the same probability of making it into the sampled result. Also, joins on samples often results in very small results, because the probability that both a tuple A and the tuple referred to by A’s foreign key must be chosen in the sample.
To fix this problem, this paper introduces the technique of using Join Synopses. The intuition is, for relations A, B, C where A has a foreign key referring to B, B has a foreign key referring to C, every tuple in A maps onto exactly one tuple in the result of A join B join C. Thus, a uniform sample of A joined with B and C will result in a uniform sample of A join B join C. This uniform sample can be used to accurately answer approximate queries.
The authors propose to keep a join synopsis for the maximum foreign key join of each relation. The maximum foreign key join of a relation A is the uniform sample of R1 join R2…join Rn where R1 = A, Ri has a foreign key referring to Ri + 1, and n is maximized. To decide how many tuples to sample for each relation, storage space can be allocated to each synopsis equally, in proportion to the tuple size, or in proportion to the cube root of the tuple size.
The paper presents an intuitive and easy-to-understand solution to the problem of joins in sampling methods in approximate query processing. The method is not space intensive.
Is it possible to use a stratified sampling approach to join synopses?
What happens when a table has two different max foreign key joins?
This paper tackles the problem of approximate databases in the presence of multi-way joins. More specifically, the paper focuses on multi-way joins that only use foreign keys, which is the case in data warehousing environments. It is useful to be able to sample the relations in a database and provide approximate answers to aggregation queries. However, when joins are involved, the samples taken cannot be simply joined together and be expected to represent a sample of the join. This paper discusses a new idea called join synopses and has created a database called Aqua that leverages this idea to improve performance and query accuracy. In a foreign key join setup, the join synopsis is created by taking a sample of the relation and joining it with its decedents as defined by the paper.|
This join synopsis is very useful in that it is a much smaller set of data that can be used to obtain random samples of any join that involves the relation. One question is how to maintain a join synopsis in the presence of inserts and deletes. The paper states that for deletes, the synopsis is only affect if the row deleted is part of the synopsis. In this case, the synopsis is only recomputed when it becomes too small. For inserts, a tuple is added to the sample of the relation with a certain probability. If it is added, then the join of the tuple with the descendant relations is computed and added to the join synopsis.
This paper takes on an important issue in approximate computing which is how to provide representative samples in the face of joins. However, although the paper does present a solution, it seems that this solution is very specific to data warehousing environments. Many assumptions such as the joins are only foreign key joins and there are no cyclic foreign key dependencies. The need for acyclic graphs is especially bad since this solution has already scoped the problem down to foreign key joins so I think it should at least work even for cyclic dependencies. Furthermore, it seems like Aqua only supports basic, flat queries. I do notice that the paper mentions that Aqua will be extended to support more queries, but I think work should be done into supporting different join types as well.
In a data warehouse environment, approximate but fast query answering is more desirable over exact but slow query answering. This paper focuses on approximate join aggregates, especially with foreign-key joins. The proposed approach utilizes a very small number of join synopses (precomputed samples of a small set of distinguished joins). The objective is to minimize the overall error in the approximate answers computed using the join synopses while improving response time. This is part of the develop of the Aqua system.|
Traditional approaches provide poor approximate query answering results because they usually take uniform random samples of each base relation in the database. A naive solution to this is to execute all possible join queries of interest and collect samples of their results. However, it is obviously infeasible. The proposed approach in this paper obtains random samples of all possible joins in the schema through samples of the results of a small set of distinguished joins. The quality of approximate join aggregates is thus improved.
The contributions of this paper include:
1. It proposes "join synopses" to provide better sampling of possible joins (better uniformity) and thus improves the quality of approximate query answering.
2. Experiment results are provided to show the accuracy of their approximate answers and address the update problem on the join synopses.
Motivation for Join Synopses for Approximate Query Answering:|
Exact answers are often not required because users may seek quick results that are a close approximation. Approximate answers also give tentative answers to a query when the base data is unavailable. In the case where a query requests numerical answers, the full precision of the exact answer is not needed. Fast approximate answers can also be used within the query optimizer to estimate plan costs. Schemes that provide approximate join aggregates that rely only on random samples of base relations is problematic because the join of two uniform random base samples is in general non-uniform random sample of the output of the join and thus can degrade the accuracy of the answer and the confidence bounds. Also, the join of two random samples usually has results of small sizes; the accuracy of the answers and the confidence bounds can be negatively affected by queries with small result size. This paper suggests using join synopses, which precomputes samples of a small set of distinguished joins, to compute approximate join aggregates and greatly improve accuracy of approximated answers for arbitrary queries with foreign key joins. The paper also discusses strategies for allocating space for the join synopses when the work query is know and finding heuristics when the work query is not know. Join synopses can also be maintained even when updates are occurring in the base relations.
Details on Join Synopses:
For queries with foreign-key joins in the join synopses solution, it is possible to provide good quality approximate join aggregates using a small number of join synopses. The optimal strategy for allocating space among the join synopses when properties of the query workload are known is observing the error bounds with Hoeffding bounds and Chebychev bounds, and then designing an optimal allocation for join synopses that minimizes the error. When properties of the query are not know, as is commonly the case, we can determine allocation for join synopses using heuristic strategies: EqJJoin divides up the space allocated for each synopses equally between the relations, CubeJoin divides up the space among the relations proportionally to the cube root of the size of the join synopsis tuple sizes, and PropJoin divides up the space proportionally to their join synopsis tuple size. From the experimental results, the join synopses perform better than the base synopses, and the join synopses can very quickly compute approximate join aggregates. The chunking estimation approach, which subdivides the sampled tuples in a join synopsis into k chunks and reports an estimate and a bound for each chunk, is used to improve the confidence bounds for results. Join synopses can be maintained even in the presence of updates on the underlying base relations. The research in the paper contributes to an efficient decision support system based on approximate query answering.
The research in the paper contributed to the development of Aqua, an efficient, approximate query answering based, decision support system. Aqua provides probabilistic error/confidence bounds on the answer, based on Hoeffding and Chebychev formulas. Aqua maintains synopses, which are smaller-sized statistical summaries, on the warehouse and uses them to answer queries instead of having accesses to the original data, resulting in improved response times. One key component to Aqua is a statistics collection, which is responsible for collecting all the synopses that Aqua uses to answer queries posed by the user. Query rewriting posed by the user instead of using the synopses allows Aqua to achieve response time speed-ups. The module parses in the input SQL query and generates an appropriately translated query. The maintenance component of Aqua keeps the synopses up to date while updates on the underlying data occur.
Strengths of the paper:
I felt the paper was a very technically thorough discussion on how join synopses is implemented, providing the algorithms and theorems contributing to its development. I felt that precomputing samples of a small set of distinguished joins, to compute approximate join aggregates and greatly improve accuracy of approximated answers for arbitrary queries with foreign key joins is a clever approach to the problem, and it surprised me that there was so much to consider when separating a join into small join synopses.
Weaknesses of the paper:
When describing chunking, I would have liked to see that paper go in more detail over how chunking minimizes error. I feel like the argument for improved accuracy measures would have been strengthened if the paper talked about alternatives to chunking, and presented experimental results on the quantitative effect of these methods on result accuracy. I also felt that the algorithm for maintaining the Join Synopses was hard to follow, so I feel that an image depicting the algorithm would have made it easier to understand.
This paper, titled "Join Synopses for Approximate Query Answering" solves a problem with join queries in large scale data warehousing environments where approximate computing is appropriate. The research presented in this paper was developed as part of the Aqua system. This is a system that exists to attempt to avoid ever hitting original data by maintaining a collection of statistics about data contained in the database. The problem that they solve is that table joins cannot accurately use uniform sample information that is typically stored in individual tables. The size of the result set and the non-uniformity of the joined set representation in the individual tables results in poor answers and confidence bounds. The authors discuss their solution of joint synopses, their method for which joins to sample, sizes of these samples, their maintenance and their performance on the TPC-D benchmark data set.|
This paper has a few strengths. It motivates a well-defined problem with join queries in approximate computing. The idea that you can compute join synopses that provide statistical summary information for any join query by sampling a smaller set of joins is interesting and useful. I also appreciated that they have a separate paper that contains the more theoretical work relevant to their findings. They evaluate their findings on the TPC-D benchmark which is highly appropriate because this is a benchmark that contains often large queries that join many tables. These are queries that are used for decision support which are the types of questions that approximate computing in data warehousing is meant to answer. They show that approximate join aggregates can be performed very quickly on this data set.
There is a key drawback of the paper and it stems from the sampling method. Using uniform sampling on distinguished joins will not work well for group by operations. These are queries that do exist in the TPC-D benchmark and will change the resulting distribution by combining output rows with similar values for a particular column. The paper does not talk about these types of queries and even if they do not exist in the majority of TPC-D queries, there is no reason to think they would not in another decision support application. An analysis of query types for decision support workloads or more importantly, an analysis of group-by accuracies would have strengthened the paper.
Part 1: Overview|
This paper presents a way to solve the hard problem of providing good approximate answers to join-queries with only statistics from the base relations. They propose Join Synopses which only precomputes one join synopsis for each relation and shows that it would significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. They explore some heuristics to deal with the situations where there is no workload information. Incrementally maintaining join synopses algorithm is also provided in order to deal with updates to the base relations. They showed in the paper both theoretically and empirically, the schemes for providing approximate join aggregations that only depends on samples would face serious problems. Then instead, they proposed that the use of precomputed samples of a small set of distinguished joins would significantly help.
Confidence bounds are crucial for approximate query answering. They provide empirical ways of finding the confidence bound. Aqua System is designed to improve the response time by avoiding congestion of visiting the original data. Aqua does three major things, statistic collection, query rewriting, and maintenance of the synopses. Joins, as non-uniform result sampe exists and the result of a join operation is typically small, in general it is impossible to produce good quality approximate answers using samples only. By adding pre-computed join synopses, the problem would become less severe. EqJoin, CubeJoin, and PropJoin are proposed as heuristics for adaptively allocate spaces for various join synopses.
Part 2: Contributions
As big data and data analysis thrive in the industry, answering queries fast is becoming the crucial part of online database engines. Sometimes the user only cares about approximate answers to their questions as there may exist too many answers. In this case, Fastly provide good approximate answers is becoming more and more important. This paper shows the complexity of this problem and also proves even small piece of pre-computation would help a lot in finding good approximation of answers. Which could be used as a guideline to other works on this topic.
Par t3: Drawbacks
They used empirical method to find the confidence bound for approximation, which is not as strong as their proof for the hardness of finding approximate answers to queries.
They only did simulation on TPC-D workload and no real world data are used. Added that they are using empirical methods, it is hard to convince readers of the effectiveness of the pre-computed join synopses.
The paper discusses a technique called join synopses as an effective solution for providing fast, approximate answers to complex aggregate queries. The problem is that traditional query processing focused on providing exact answers to queries. However, in an environment involving large amount of data such as warehousing, providing an exact answer to a complex query takes a long time. In addition, a number of scenarios do not require an exact answers. For example, a query with numerical answers might not need full precision answer. This paper provides estimated responses in much smaller time than what it takes to determine the exact answer. |
The authors employes technique which involves the usage of precomputed samples of a small set of distinguished join (join synopses), in order to compute approximate join aggregates. Only very small number of join synopses can provide good quality approximate join aggregations. The authors design a heuristic allocation strategies that optimally allocate limited amount of space for join synopsis. In addition, the paper provides technique for computing bounds on the error on the query answers.
The main strength of the paper is that the authors address an important kind of problems whch being faced by traditional query processing. As the amount of data in the internet and companies increasing exponentially, it is becoming very difficult to keep up with the fast response time required by users. As a result, approximate answers are important for some kind of applications.
Although I like the fact that approximate query is necessary for dealing with the current problems large data sets, approximate results might not be acceptable for some applications. Furthermore, different techniques including in-memory database are poised to provide faster exact query processing. With this platforms, exact query processing is becoming acceptable even for interactive queries. Consequently, it is important to evaluate different platforms for exact query processing and reevaluate if approximate query processing is still relevant.
In this paper, join synopses as an effective solution to provide good approximation to join-queries is presented. By precomputing one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins. TPC-D benchmark database is used to evaluation the effectiveness of join synopses and other techniques proposed in this paper.
Providing fast, approximate answers to complex aggregate queries based on statistical summaries of the full data is desired. But providing good approximation for join-queries is difficult. There are a number of scenarios where exact answer over large amount of data are not required. It is shown both theoretically and empirically that schemes for providing approximate join aggregates that rely on using random samples of base relations alone suffer from serious disadvantages like non-uniform result sample and no output form small join result sizes. The paper argues that it impossible to provide approximation with good quality
The design use small amount of precomputed samples of a small set of distinguished joins in order to compute approximate join aggregates. Unlike other related papers about this topic, this paper provides strong theorem to proof its usability, which is better than previous approaches. With this solid argument, they prove that it is possible to create compact join synopses of a schema with foreign key joins such that they can obtain a random samples of any join in the schema. Furthermore, based on the theorem, they provide an optimal strategy for allocating the available space among the various join synopses when certain properties of the query work load are known and identify heuristics for the common case.
Finally, the experimental evaluation shows the validity of the techniques in this paper. The method can be used to compute approximate join aggregates quickly and can be maintained inexpensively during updates.
This paper only solve join queries approximation. However, as said at the end of the paper, a lot of other kinds of queries remain undiscussed in this paper like group-by, rank and set-valued queries, which should be addressed in their future work.
This paper describes a method for computing JOIN operation approximately. This method is implemented as an augmentation for the original Approximate Query Answering (AQUA) system. AQUA has three main component |
Uniform sampling has two main problems when used for computing JOIN operation. The first is non-uniform result sample, which means the correlation between rows of different table will leads a table that is not uniform for original keys. The second problem is small join result size. This paper solves this two problem using join synopses. Join synopsis, in my understanding, is a set of row resulted from joining a uniform sample of a table to source rows of other tables. They proved that it is totally possible to provide a good result using these join synopses.
Another work of this paper is an algorithm for updating join synopses along with database updates. They also proved this algorithm is correct.
Written in 1999, this paper is among the first a few papers which talk about approximate JOIN query. The key contribution the authors made in this paper is that they showed a small number of join synopses can provide good approximation for join aggregates.
This paper provided results of a benchmark. But I was wondering about what the performance of this system looks like for a real workload. I think resource usage and response time of a real workload are important for evaluating such a system.
What I dislike about this paper is that is use to many prove instead of giving intuitive idea. This makes reading this paper much harder.
This paper proposes join synopses as a solution to improving the quality of approximate answers for multi-way joins. A join synopsis is defined as a precomputed sample of a small set of distinguished joins. This paper proposes the system Aqua that provides probabilistic error and confidence bounds based on the Hoeffding and Chebychev formulas. These statistics are stored in the form of samples as well as histograms.|
Their system Aqua has three components: a statistics collection that collects all the synopses used for the queries. A query rewriter that rewrites the original queries to use the stored synopses and the maintenance system that keeps the synopses up to date. This paper proposes different join methods (EqJoin, Cubejoin and PropJoin) to reduce the max error bound in the absence of query work load therefore resulting in qualitative approximate answers.
One of the key advantages of this algorithm is that it performs look-ups on the base table with a very small probability. Another advantage is that the authors specify that these join synopses can be maintained with very little overhead since they do not update the samples for drastic changes too often.
One of the major disadvantages that I see is that storing these join synopses can result in a space deficiency or that all the required join synopses cannot possibly be stored efficiently. Also, for queries that do not involve multi-way joins, join synopses might be too complicated and cumbersome to use.
However, considering this system would be specifically used in a data warehousing system that involves a lot of fact-dimension joins, this method provides a great improvement and accuracy over existing methods.
paper review: Join Synopses for Approximate Query Answering|
In this paper, a method for approximating join query answering on large data set is proposed. Historically, DBMS focused on exact answers for aggregations, as the number of records becomes large, it is going to take a lot of time to get the result. And in real world use, for complex aggregate queries based on statistical summaries of the full data, it is often advantageous to provide fast, approximate answers. But using the trivial sampling method for each table, two problems would arise:
1. Non-Uniform Result Sample: In general, the join of two uniform random base samples is not a uniform random sample of the output of the join. The probability of any joined tuples to be in the former should be the same as their probability in the later.
2. Small Join output size: The join of two random samples typically has very few tuples, even when the actual join selectivity is fairly high. This can lead to both inaccurate answers and very poor confidence bounds since they critically depend on the query result size.
The novel idea of Join synopses is put forward to provide an effective solution for producing approximate join aggregates of good quality. The guiding idea of join synopses is that by computing samples of the results of a small set of distinguished joins, one can obtain random samples of all possible joins in the schema- distinguished joins as join synopses. This paper introduced the theory that, the subgraph of G on the k nodes in any K-way foreign key join must be a connected subgraph with a single root node and there is a 1-1 correspondence between tuples in a relation r1 and tuples in any -way foreign key join with source relation r1. It is trivial to see that joining tuples in any relation other than the source relation will not in general be a uniform random sample of . Hence we need Distinct join synopses for each node/relation.
And to sum up, the Aqua system has the following advantages: firstly, based on sampling from materialized foreign key joins, less than 10% more space is needed to achieve uniform sampling for any foreign key join. Secondly, in terms of accuracy, there is a significant improvement over using just table samples. For example, in the TPC-H queries, Aqua system achieved 1%-6% relative error rate, while other methods may give 25-75% relative error rate. Lastly, the maintainance cost is low. However, there are still many weakness in this paper: for instance, the type of aggregation supported by the new synopses join is limited, it apparently can’t accurately approximating answers to group-by, rank and set valued queries. And another shortcoming for this paper would be that, it didn’t provide enough proof or analysis for the formula used to develop space allocation. Last but not least, there isn’t enough performance evidence presented in the paper to inform the reader of by how much the Aqua system could outperform other old sampling method.
This paper introduces an approach to use join synopses to approximately answer join queries.It stores join synopses, which is the k-way join result between source relation sampling and other relations. Theorem shows that a uniform random sample of output of a multi-way foreign key join can be extracted from join synopses. |
Based on the synopses, this paper develops both optimal and heuristic strategies for space allocation to achieve least average error bound. The improved accuracy measure for join synopses is introduced, which explores the sub-sampling technique and tries to use empirical error bounds for chunks to improve the accuracy of the traditional error bounds. In addition, join synopses can be incrementally maintained and the evaluation shows that the maintenance cost is inexpensive.
1.Join synopses is a significant improvement from evaluating joins using only single relation samples. It provides an efficient way to uniformly sample output of joins and avoids the problem of non-uniform result sample and inaccurate answers.
2.Join synopses can be incrementally maintained and the strategy is simple, which makes it applicable even with workload of frequent update. The “batch” mode can avoid the possible concurrency bottle neck.
Join synopses can be large. For each relation, join synopses stores not only the samples at this relation but also the referenced tuples by samples at other source relations. It consumes more space than sampling a single relation. As a result, the sample size can be small if space is limited and the accuracy of answers might be degraded.
This paper introduces join synopses as a solution to providing approximate answers to join queries. The problem with join queries is that the result of joining two uniform random samples is not a uniform random sample of the join of the full tables. This paper addresses this problem with a system composed of three components: a statistics collection unit, a query rewriting unit, and a maintenance unit to update data synopses. The authors evaluate the system on an UltraSPARC-II machine and show that it performs several times better than an exact computation with small error and low overhead.|
The authors discovered that it is possible to compute random samples of all possible joins by maintain a small subset of the joins. This dramatically reduces the number of samples that must be maintained so that it is at a level where it is manageable without resorting to heuristics that monitor workload.
This paper is able to introduce a general solution to providing approximate responses to join queries by maintaining a distinguished set of random samples, but it has some drawbacks. This paper has very few experiments and relies on a single benchmark for performance results. This is problematic because if the authors were using this while conducting the research, it's possible that the evaluation is over-optimistic as this benchmark is what they optimized for. The evaluation section also did not compare against previous work which may make the perceived improvement greater than it is.
This paper proposes join synopses as a better way to generate approximate solutions in join queries. It is difficult to provide accurate approximations for joins because errors will compound with each table that is joined. For example, if we are going to join two tables and we take 1/5 of each table to sample and join, the resulting view will actually be 1/25th of what the actual join would be. Therefore, the error in the approximation using that join table will be compounded.|
The positive contribution that the authors made was an algorithm that provided a better approximation and gave an error bound for that approximation. Aqua, the support system that they built, provides statistic collection, faster responses with query rewriting, and maintenance of join synopses. Since taking a join of random samples will not solve the problem of trying to get an approximation of a join query, the authors propose a concept called join synopses. By computing samples of a small set of distinguished joins, we can get a random sample of all possible joins and answer approximate join queries. The system will then have to maintain these join synopses so that they are always up to date.
The following are what I think are the positives of this paper:
1. The authors created a practical system that implements the algorithm that is proposed in this paper to get results about how accurate the algorithm can be.
2. They conducted multiple experiments on different aspects of the system like the accuracy of the results, the performance of the system and the maintenance of the join synopses.
However, there are some concerns that I have with the paper:
1. The majority of the paper is describing the theorems used to prove that their algorithm works. I would have liked to see examples about how the algorithm is applied with some sample tables and data.
2. This algorithm works with joins, but could the same idea be used to approximate group by queries and holistic queries?
This paper discusses the use of join synopses for approximate query answering. The focus of this paper is creating a structure for maintaining a relatively high degree of accuracy for complex queries written against approximate databases. Several approximate databases, such as BlinkDB do not support nested queries or some complicated join statements, only accepting queries that can be flattened into single statements. It is difficult to estimate statistical information about join products, as results are often non-uniform after joins and can have fairly small sizes, which creates a large degree of skew. The authors of this paper sought to create a framework that maintained a high degree of accuracy for queries that involve joins.|
In order to do this, the authors implemented join synopses, which sample foreign key joins across a large number of columns and maintain small samples of theses query results. The authors stated that the statistical methods used in the computation of these join synopses allowed them to create a uniform random sample of each join result for every possible join in the schema. Additionally, the space required to maintain these random samples was less than 15% of the total database size. This is a significant savings compared to systems like BlinkDB in which samples can account for 50-100% of the total database size.
In the experimental results section, the authors showed that they were able to answer queries with a fairly high degree of accuracy – typically within 14% - while maintaining very little statistical information. Additionally, the cost over time of maintaining these samples is very low, as only .05 to .2% of inserts had to be sampled in order to maintain accurate information, even for an insert-heavy workload.
The chief complaint that I have with this paper is that the experimental results were very difficult to understand. In particular, the graphs of space allocated for statistical summaries vs. average expended price were very difficult to interpret. The data is very dense, difficult to distinguish, and seems to jump all over the graph, yet the authors claim to see a clear pattern in the results.
This paper proposes join synopses as an effective solution for approximate queries and show it improve the quality of approximate answers for queries. The motivation is that in large data environments, it is often advantageous to provide fast, approximate answers to aggregate queries.|
First, the paper talks about the Aqua System. The goal of Agua is to improve response times for queries by avoiding accesses to the original data altogether. The key components of Aqua are statistical collection, query rewriting and maintenance. Second, the paper talks about the join synopses. It provides some proofs to illustrate its ideas to the problems.
The strength of the paper is that it provides detailed proofs to support its idea, while the weakness is that it provides few example when explaining the idea.
This paper is an overview of why providing good approximate answers to queries is difficult, especially for queries with joins. This paper introduces the usage of “join synopses” as a solution to this problem and shows that using a join synopsis per relation significantly improves the quality of the approximate answers. The goal of this paper is to show that they have a system capable of providing good approximate queries in orders of magnitude less time than it would take to find the exact answer.|
The system they have introduced in this paper is called Aqua which is responsible for receiving queries and doing as much computation as it can (potentially providing the query answer) while trying to go to the data warehouse as little as possible. Aqua stores it’s own metadata and all new data goes to both the data warehouse and Aqua so that they can both make the necessary updates. Aqua creates it’s join synopses and creates them in a portion of the data warehouse so that it can access a synopsis when possible (which is much smaller than the whole data) to avoid reading all the data needed for the query to be exact. Joins are difficult though because they could provide for a very small result or a non uniform result sample which decreases the accuracy of the answer and confidence bounds. The proposed solution is join synopses.
Join synopses are the solution to the join problem mentioned above. It is not feasible to take a sample of every possible join by doing all possible joins and sampling them, so instead they take a sample of a few distinguished joins whose synopsis can be useful for many queries. The paper then goes pretty in depth with mathematical analysis of why these distinguished joins are possible and how they work.
The next part of the paper discusses how to allocate space for each of the join synopses when information about the query workload is either known or not known. The important section of this in my opinion is the introduction of 3 heuristic algorithms, EqJoin, CubeJoin, and PropJoin. As you can imagine EqJoin divides space up equally between all the synopses, CubeJoin divides the space up proportionally according to the cube root of the size of their join synopse space, and PropJoin proportionally to the size of their join synopse.
One of the things I didn’t like about this paper was I felt they could have incorporated more images and examples and less math in the paper. There is definitely a style of paper that is proof and math based and this was one of those papers, but I prefer example based. Other than that though I thought the paper was good.
Overall, this was a good paper for discussing how to handle joins in approximate query processing, however I am not very interested in the topic. It does a good job of explaining what it does and what is involved in join synopses and shows good results. Overall it is a solid paper worth the read if you are interested in approximate query processing.
Traditional query processing focuses on exact answers; how can we get the full result of a query in the shortest time possible with low overhead? For many applications, many advancements have paved the way for relational systems, but this problem becomes extremely nontrivial once industrial databases have to operate at scale. For example, in distributed systems, complex and clever schemes have to be implemented (usually with some degree of communication overhead) in order to retrieve an accurate result without failure. However, for some applications, if we can obtain a statistical approximation of the data without deviating too far from the exact result, this is good enough, and such slack room gives a lot more to work with in terms of efficiency and recouped overhead. The goal of this paper is to provide efficient algorithms for such data retrieval (particularly in the context of fast joins) in their system called Aqua.|
The key component of the Aqua system is keeping track of “synopses,” or small aggregate summaries of chunks of warehoused data. On raw data, this is simp enough, but they spend most of the paper considering efficient approximations of high quality JOIN aggregation. By computing samples of a small set of distinct joins (synopses), they demonstrate they can retrieve an approximate random sample of all possible schema joins. They provide, through some relatively simple probabilistic explanation, bounds on the error retrieved on aggregate operations through these random join subsets. In addition they describe *-join operations for determining how to allocate space to each join. Lastly, if a tuple is deleted, they remove it from the synopses, and if a tuple is added, they probabilistically determine whether or not it should be added to each synopsis.
Their performance benefits are impressive, but I felt their system was lacking in regards to other common aggregate operations, such as GROUP BY or RANK. In the same vein, the solutions only apply to aggregate query operations (what about other operations?), and I felt that the space allocation section of the paper was too important to be briefly summarized the way it was. The progress so far is notable, but it is important to keep in mind they do not exactly provide a fully fledged system like Aqua for a variety of ordinary DB operations.
Traditional query processing focuses on exact answers; how can we get the full result of a query in the shortest time possible with low overhead? For many applications, many advancements have paved the way for relational systems, but this problem becomes extremely nontrivial once industrial databases have to operate at scale. For example, in distributed systems, complex and clever schemes have to be implemented (usually with some degree of communication overhead) in order to retrieve an accurate result without failure. However, for some applications, if we can obtain a statistical approximation of the data without deviating too far from the exact result, this is good enough, and such slack room gives a lot more to work with in terms of efficiency and recouped overhead. The goal of this paper is to provide efficient algorithms for such data retrieval (particularly in the context of fast joins) in their system called Aqua.
The key component of the Aqua system is keeping track of “synopses,” or small aggregate summaries of chunks of warehoused data. On raw data, this is simp enough, but they spend most of the paper considering efficient approximations of high quality JOIN aggregation. By computing samples of a small set of distinct joins (synopses), they demonstrate they can retrieve an approximate random sample of all possible schema joins. They provide, through some relatively confusing probabilistic explanation, bounds on the error retrieved on aggregate operations through these random join subsets. In addition they describe *-join operations for determining how to allocate space to each join. Lastly, if a tuple is deleted, they remove it from the synopses, and if a tuple is added, they probabilistically determine whether or not it should be added to each synopsis.
Their performance benefits are impressive, but I felt their system was lacking in regards to other common aggregate operations, such as GROUP BY or RANK. In the same vein, the solutions only apply to aggregate query operations (what about other operations?), and I felt that the space allocation section of the paper was too important to be briefly summarized the way it was. The progress so far is notable, but it is important to keep in mind they do not exactly provide a fully fledged system like Aqua for a variety of ordinary DB operations. I also felt as though the theoretical explanation on error bounds could have been more descriptive, as their explanation seemed insufficient and relatively hand-wavy. Perhaps this is okay in some database papers, but when the whole idea is to achieve tight bounds on probabilistic error, it is too important to be glossed over.
The paper proposes a method to provide good approximate answers for join-queries using only statistics (samples) from base relations by using join synopses. One motivation is because approximate query answering has become increasingly essential in data warehousing and other applications. |
In this paper, the join synopses approach is implemented on Aqua, a decision support system based on approximate query answering. Aqua has three key components: statistics collection, query rewriting, and maintenance. The paper then points out the problems with join operations when using base sample: non-uniform result sample (the join of two uniform random base samples is not a uniform radom sample of the join) and small join result size (the join of two random samples typically has very few tuples). Join synopses computes samples of the results of a small set of distinguished join so that random samples of all possible joins in the schema can be obtained. In particular, the paper focuses on the problem of foreign key joins. The next section explains how to create optimal random samples of any joins in a schema. The paper continues with strategies to allocate available space among the various join synopses. When query loads are known, it seeks to select join synopses sizes so as to minimize the average relative error over a collection of aggregate queries based on characterization of the set queries. When the loads are unknown, it uses heuristic strategies: EqJoin, CubeJoin, and PropJoin. Next, the paper talks about improving accuracy measures by using chunking technique for empirical error bounds. Then it discusses the maintenance of join synopses. Assuming that update is applied in batch mode, join synopses can be kept up-to-date all the time without bottleneck. In online environment, maintenance is done only periodically. The approximate queries will not take into account the newly inserted data, thus weakening the confidence bounds. The experiment evaluating join synopses accuracy, query execution timing and maintenance shows that it is superior to base sample. Lastly, in related works section the paper mentions several approaches in approximate query answering and statistical techniques.
The main contribution of this paper is it provides solution for sampling issues in approximate query processing. Join synopses method precomputes just one join synopsis for each relation, with the goal to provide an estimated response in orders of magnitude less time than the time to compute an exact answer by avoiding/minimizing the number of accesses to the base data. The paper shows that using join synopses gives better performance than using base samples for computing approximate join aggregates. It also shows that join synopses can be maintained efficiently during updates to underlying data. It is also a good thing that the paper considers condition both where the workload is known and unknown.
However, the paper does not explain about join operations that are not based on foreign keys. For example, when certain attributes does not have to have a value, or when a table is build based on two unique indexes, or when doing left/right join. How would join synopses treats the null value? While in data warehouse usually the tables are simplified, it does not cease the possibility of having non foreign key joins (for analytical purpose). --
This paper proposed an approximate join algorithm and its implementation Aqua. The proposed algorithm, which demonstrated by Aqua can retrieve result of join algorithm very fast yet with confidence to indicate the approximation. This paper address the problem of Approximate results for join queries, which is especially useful for data warehousing where ad-hoc query on large scale data are common. Error are acceptable since for ad-hoc queries latency is the main concern. Moreover, approximate query can also be applied to estimate query cost.
The implementation of the approximate join approach, Aqua, maintains a small sized statistical summaries of data (table) named as “Synopsis” to provide approximate results. Synopsis are updated with data changes so query results are never stale. Moreover, Aqua identifies that the joins of samples are not the samples of joins, so it maintains a small set of distinguish joins to provide results related to them. This paper also discusses several criteria on choosing the size of join synopsis to minimize the error. Aqua also provides confidence bounds to indicate the result accuracy. In the experiment, it shows that Aqua is both very fast on querying join tables, and also has very low overhead when updating data.
1. This paper introduces Aqua and its underlying algorithm to give approximate result for join queries. It has fast response time and have low overhead for data changes.
2. This paper identifies several key points for join query approximation, which could potentially very useful for its followers
3. When discussing the error bound, this paper discusses several criteria and also the matrices to measure them, which is also very insightful and inspiring for its followers.
1. The Aqua only support for foreign key join, other aggregate operators like group by, cluster by are not supported by the Aqua or at least not discussed in this paper. It would be more interesting if this paper can also discuss the implementation of these queries.
2. The experiment of this paper only test for the data change of insertion, lack the analysis of deletion. It would be more complete if this paper also shows the overhead of deletion, which is also more convicting.