This paper presents Quickr, a system that approximates answer to complex ad-hoc queries in big data clusters, by injecting samplers on the fly and without requiring pre-existing samples. To deal with heavy-tailed query distribution and complex queries, an approximation of queries is hard but can improve performance. The goal of this system is to approximate complex queries without assuming input distribution and ensure accurate answers. Quickr has three samplers: uniform/distinct/universe sampler, and an optimizer (QSALQA) selects the sampler to use (which estimates costs and push samplers across operators when needed). Sampler push rules and accuracy analysis procedure are introduced in detail in the paper. For evaluation, the paper aims to answer questions including performance (speed-up), accuracy (how often are output correct), and the source for gains. Based on TPC-DS benchmark, it's shown that Quickr has good performance and accuracy (outperforming BlinkDB). The part I like about this paper is it has good evaluation flow, covering performance/accuracy, with detailed comparison with other systems. I also like the idea of just-in-time samplers, which is intuitive, and obviously requires broad knowledge to implement. The part I don't like about this paper is, it's very hard to read. For example, in the accuracy analysis part the paper provides a lot of mathematical detail which are hard to follow. One thing I didn't think of before reading this paper is the poor performance of BlinkDB on the benchmark this paper presents. This paper's result shows BlinkDB is barely effective, not the case where the BlinkDB paper presents. This may be due to different assumptions on the input queries and data to be tested on, since BlinkDB is assuming consistent QCS while Quickr has no such limitations (and I think this is the strongest point of Quickr). |
The growth of big-data ecosystem frameworks like Hive, Pig, Spark-SQL and so on give rise to the problem of approximation to the complex ad-hoc queries. Approximating big-data queries is important because it will significantly reduce the cost of query processing on large volume of data on expensive clusters. This paper proposes Quickr, which can offer turn-key support for approximations, decide whether or not a query can be sampled and output an appropriate query plan with samples. Quickr also support complex queries covering a large portion of SQL queries. It doesn’t assume that input samples exist or that future queries are known. Finally, Quickr ensure that answer is accurate that none of the groups will be missed with high probability. The paper first shows the analysis of queries from production clusters. Then it describes samplers and present ASALQA algorithm. Finally the paper shows the accuracy analysis and experiment results. Some of the strengths and contributions of this paper are: 1. Quickr offers a new way to lazily approximate complex ad-hoc queries with zero apriori overhead. 2. Introduce universe, which is a new sampler operator that effectively samples join inputs. 3. Invest query optimization over samples more extensively. Developed ASALQA algorithm which can automatically outputs sampled query plans only when appropriate. Some of the drawbacks of this paper are: 1. The approximation depends on multiple assumptions which may not be true in some workloads like output << input. 2. Quickr cannot support some online aggregation feature like keep updating the answer to let user run the query until the answer is satisfied. 3. Quickr doesn’t use existing indices on tables like sampling blocks or B-tree, which may make the query more effective. |
Approximating big-data queries are useful in: (a) queries that analyze logs to generate aggregated dashboard reports if sped up would increase the refresh rate of dashboards at no extra cost and (b) machine learning queries that build models by iterating over datasets (e.g., k-means) can tolerate approximations in their early iterations. However, this problem is complex because (1)The distribution of queries over inputs is heavy-tailed.(2) Queries are complex. The existing state-of-art techniques either fail to reason about the accuracy of the resulting answer or do not support joins over more than one large table, queries that touch less frequently used datasets or query sets that use a diverse set of columns. Therefore, this paper proposed Quickr, which approximates the answer to complex ad-hoc queries in big-data clusters by injecting samplers on-the-fly and without requiring pre-existing samples. Basically Quickr introduces a new universe sampler that can sample both join inputs, and injects samplers into the query plan. To do so, it uses statistics such as cardinality and distinct value counts per input dataset which are computed in a single pass by the first query that touches the dataset. Quickr offers the ASALQA algorithm to solve the problem of which sampler to pick and where to place the samplers in the query plan. The key contributions and advantages of the paper are as follows: 1. The proposed framework offered a new way to lazily approximate complex ad-hoc queries with zero apriori overhead. 2. Through careful analysis over queries in a big-data cluster, this model showed apriori samples are untenable because query sets make diverse use of columns and queries are spread across many datasets. The large number of passes over data per query makes the case for lazy approximations. 3. This paper introduced a new sampler operator – universe – that effectively samples join inputs. 4. This model considered query optimization over samples much more extensively. The proposed ASALQA algorithm automatically outputs sampled query plans only when appropriate. 5. The model was actually implemented in a production big-data system, which means it is practical. The main disadvantage of the proposed model is the quantifying error produced by the proposed model is high and the authors did not provide a way to reduce the error. Another thing we need to point out is that as a sampling based approach, Quickr considers collecting samples in an offline stage versus drawing sample online after a query is given. This offline approach has very small overhead at query time, as estimation is done only on the sample. But it may suffer from large errors on highly selective queries, as few sampled tuples may satisfy the predicates. Therefore, we can use online approach, which is more effective with selective predicates, as it can focus the sampling only on those tuples that pass the predicates. However, it incurs a higher cost at query time to retrieve tuples from the base tables, which may reside on disks on even on remote machines. So there is a trade-off between these two methods. |
Problems & Motivations Traditionally, data scientists usually perform many complex queries on a large amount of data. Data scientists can tolerate imprecise answers for exploratory analysis, yet longing for a shorter response time. However, the start-of-art techniques do not support approximation for complex queries. Intuitively, they only sample the column data and it will not be efficient to sample the join queries somehow. Main Achievement: The authors propose Quicker. The key insight of the Quicker is that it samples query plans rather than samples query results. In other words, if the user inputs a complex query, instead of finding the sample result (or data) that can best fit the actual result, the Quicker will try to find the sample query plan such that the result gets from the sample query plan can best fit the actual result. And to achieve this, the quicker make universe sample on the join key. Drawbacks: The paper based on the assumption that the group columns are independent with the join keys since the different group will have similar likelihood somehow. However, it will produce the problem mentioned in BlinkDB. Problems & Motivations Traditionally, data scientists usually perform many complex queries on a large amount of data. Data scientists can tolerate imprecise answers for exploratory analysis, yet longing for a shorter response time. However, the start-of-art techniques do not support approximation for complex queries. Intuitively, they only sample the column data and it will not be efficient to sample the join queries somehow. Main Achievement: The authors propose Quicker. The key insight of the Quicker is that it samples query plans rather than samples query results. In other words, if the user inputs a complex query, instead of finding the sample result (or data) that can best fit the actual result, the Quicker will try to find the sample query plan such that the result gets from the sample query plan can best fit the actual result. And to achieve this, the quicker make universe sample on the join key. Drawbacks: The paper based on the assumption that the group columns are independent with the join keys since the different group will have similar likelihood somehow. However, it will produce the problem mentioned in BlinkDB. Problems & Motivations Traditionally, data scientists usually perform many complex queries on a large amount of data. Data scientists can tolerate imprecise answers for exploratory analysis, yet longing for a shorter response time. However, the start-of-art techniques do not support approximation for complex queries. Intuitively, they only sample the column data and it will not be efficient to sample the join queries somehow. Main Achievement: The authors propose Quicker. The key insight of the Quicker is that it samples query plans rather than samples query results. In other words, if the user inputs a complex query, instead of finding the sample result (or data) that can best fit the actual result, the Quicker will try to find the sample query plan such that the result gets from the sample query plan can best fit the actual result. And to achieve this, the quicker make universe sample on the join key. Drawbacks: The paper based on the assumption that the group columns are independent with the join keys since the different group will have similar likelihood somehow. However, it will produce the problem mentioned in BlinkDB. |
In the age of big data, there has been much interest in developing database systems that are able to handle queries in an approximate fashion, in order to provide meaningful information within specified error bounds without taking an unacceptable amount of time. While many approximate query techniques and systems have been proposed, most of them are unable to approximate complex queries. As the paper mentions, most use the uniform sample operator, which does not allow systems to reason about the accuracy of the answer. On the other hand, other systems that do build samples over the datasets work best on one large dataset, and cannot support join operations over a wide array of use cases like multiple large tables or a diverse set of columns, which dominate the workload in big data clusters. Quickr is a system presented by this paper that offers turn-key support for approximations, in that when it is given a query, it can decide whether it can be sampled, and develop an appropriate query plan. Additionally, it is able to support a wider range of complex queries, and it does not make assumptions about input samples or future queries. Finally, the authors claim that Quickr is able to achieve results within a bounded ratio of their true value. In contrast to other systems like BlinkDB that perform apriori sampling, Quickr reads through all of the data at least once, before using it in subsequent steps, which avoids the initial sampling overhead. Additionally, Quickr starts by searching for the best query plan such that Q’(I) ≈ Q(I), as opposed to apriori methods that look for the best sample of input I’ such that Q(I’) ≈ Q(I). This allows Quickr to avoid the costs of storing additional samples, which can range from 1 to 10x of the original input size (compared to ~20%), which makes it potentially faster. Of course, queries vary in the degree to which they can be approximated. Additionally, as mentioned before, Quickr performs just-in-time sampling, generating an execution plan at query optimization time that includes appropriately placed samplers. It uses three types of samplers: uniform, distinct, and universe samplers, with uniform being the most general, and the other being more appropriate for other use cases. Given a query plan, Quickr is able to provide unbiased estimates and confidence intervals of aggregate values, as well as the probability of missing groups, using the Horvitz-Thompson estimator with some optimizations. The main strength of this paper is that it presents a new system that is able to combine many of the advantages of approximate query engines while avoiding some of the overhead associated with the more well-known state of the art systems, such as BlinkDB and its apriori sampling methods, while also supporting more complex queries. As the authors’ results show, Quickr is able to lower resource usage by over 2x for the median query, with only about 20% of the queries having slightly longer runtimes. Their system was compared to a system that is identical in all respects except for the use of samplers. When compared to the state-of-the-art system in BlinkDB, they obtained a median speedup of 24%. This advantage was magnified as the storage budget was reduced. Some of the insights that the authors introduced, like adding samplers during the query optimization stage, had not been done in approximate query engines up to then. Finally, their accuracy analysis technique is in general faster, simpler, and is able to provide reasonable guarantees on the accuracy of the data. This system is impressive in what it introduces, but one area that the authors did not touch on as much was how well the system works with big-data centric workloads, especially in comparison with other systems like BlinkDB. Their results were based on running from the TPC-DS benchmark, which they mention is simpler than queries in their own cluster. The reasoning for using TPC-DS is to have something that could be shared publically, which is understandable, but a more complex workload analysis would have been helpful in seeing the performance gains and trends of Quickr. |
Quickr; Lazily approximating couples adHoc queries in biodata clusters. Quickr is a system that can approximate complex ad-hoc queries in bit-data clusters by injecting suppliers without requiring pre-existing samples. This paper starts with why the problems of approximating jobs are difficult and why it can be used widely. In big data cluster, queries are approximatable but relatively hard to achieve. First, the distribution of queries over inputs in heavily-tailed which means a relatively unbalanced dataset where a large of the results are located in a small portion of dataset while a small portion of the results are located in large portion of the dataset. Second, queries are complex which contains many joins which induces the execution graph really deep but in this case the output is lightweight compared with the input so the potential of speeding up occurs. Approximating big-data queries could have many use cases because sometimes people do not expect precise answers. For example: queries for analyzing logs to generate aggregated dashboard reports and machine learning queries that build models by iterating over datasets. However, recent algorithm will not achieve the requirement. Therefore, Quickr resolves the following things. First, offer turn-key support which means deciding whether queries can be sampled and output query plan with samplers. Second, handle very complex queries which may contain multiple table join. Third, the queries executions are not rely on the existing intermediate results or past histories. Fourth, realized a bound that make sure the result is accurate with the certain probability. The main idea in Quickr is the introduction of sampler and ASALQA. It introduces an universe sampler(with altogether three samplers) that can sample both join inputs by simulating a Bernoulli process. And it can inject samplers into query plan by using statistics such as cardinality. Also Quickr offers ASALQA algorithm for picking and placing the samplers in the query plan. the algorithm begins by optimistically placing a smaller before every aggregation and generating plans alternatives to move samplers closer to the input. Finally the algorithm picks the best plan that meets the accuracy. The main contribution of this paper: It provides lazy way to approximate complex ad-hoc queries with little overhead. The proposed universe sampler operator effectively samples the join inputs. The ASALQA algorithm automatically generate sampled query plans. The experiment using TPC-DS benchmark shows the improvement in runtime and reduction of resource used. For the missing group metrics, Quickr can achieve not missing group for over 90% of the queries. For the drawbacks, the quantifying error is high and Quickr is not carefully handle this case. But as the paper stated this problem is hard and none of the previous methods had done better than this. But overall this paper achieves very good result on speed and accuracy for queries. |
The paper proposed Quickr, this is also a approximate query engine that work for huge amount of data. The Quickr do not need the pre sampling, which means it do not rely on the assumption on the past. There are three things that he quickr is try to achieve. First, do not rely on the prediction of future workload, second, complex query should be supported, third, the system should offer a strong guarantee on the accuracy of the approximation. The method that is used in the Quickr to perform the online sample is called Inline sampling. Since it is observed that the complex query may take several pass, thus the overhead of Quickr is acceptable and it will provide decent accuracy. Quickr also introduce a new universal that can sample noth join inputs. The idea is to project both join input keys into high dim space and the same portion of space is picked. THis will guarantee an unbiased join sampling. It helps to increase p^2 probability in join output to p probability. Quickr has three kinds of sampler, the universe sampler described above, a uniform sampler and a distinct sampler. As for uniform sampler, it is used to mimic Bernoulli Process. It will allow a row to pass with given probability p. The distinct sampler is used to guarantee all groups are covered. All sampler can be run in first pass in parallel. The paper also introduce a novel dominance: given two query E1 and E2 with identical output without samplers. E1 is said to dominate E2, if and only if the acc(E1) >= acc(E2). The contribution of this paper mainly falls on proposed a lazy approximate complex query engine. Secondly, this paper find out that large number of passes per query makes it possible to use lazy approximation by sampling during the first pass. The Quickr is implemented in a production big data system. The limitation of this work is that the whole system is designed for very complex query, and the sampled data may not be reused. What about the the workload shift to one pass query, the paper did not give a solution for such situation. |
QUICKR is a new system for approximate query planning. The motivation behind approximate query processing is similar to the motivation described in BlinkDB. However, as QUICKR is a project from Microsoft Research, it specifically focuses on queries from a production cluster at Microsoft. The queries have aggregations, but also have a significant number of joins, and touch many columns. The paper also has a focus on reducing storage costs when compared to other systems, and makes it clear that one goal is to improve resource usage - not just query runtime. The authors describe three types of samplers, which pass on a subset of the input rows: uniform, distinct, and universe. The universe sampler is particularly interesting because it allows for effective sampling for joins. It does so using a hashing-like technique. Another difference between QUICKR and other similar systems is that sampling is linked more closely with the query planning/optimization. Sampling can happen at various points in the query plan. One of the key concepts in this paper is the ability to work with joins. The authors do a very good job describing why approximate query processing has traditionally not worked well with joins. It is hard to sample both inputs to a join and still get a reasonable result, but QUICKR is able to do just that, which is a big contribution to the field. Another small thing that stood out to me in the paper was the mention that the authors “vetted both our methodology and the results with the authors of BlinkDB.” This is a kind of collaboration that I haven’t noticed much in other papers that are comparing their systems to others, and I think that it really shows a desire to ensure that the results achieved are correct. I thought that some of the graphs and tables chosen could have used more consideration. For instance, Figure 8a has an x-axis representing a ratio between values for Baseline and QUICKR. This seemed to me like an odd way to represent the improvement from QUICKR, which could have been shown in a more traditional graph. I also found the tables with percentiles (table 3, figure 2a) to not be ideal, as they tried to represent extremely varied data in a single table. Finally, many tables showed percentiles, but they did not all use the same percentiles as their columns. |
This paper introduces Quickr, an approximate database that is able to answer complex ad-hoc queries on large-scale data clusters. The major difference between Quickr and BlinkDB is that BlinkDB creates samples before queries are executed, while Quickr injects samplers on-the-fly without apriori sampling. Quickr uses this method based on the following key observations. First, most big-data queries are very complex and require multiple passes over data. This means if data were sampled in the first pass, all subsequent computation could be sped up. That’s why inline sampling can improve performance. Otherwise, if only one pass is needed, then without apriori sampling, there won’t be any performance gain. The second observation is that big-data queries access many different inputs. This means there are no small sets of QCSs that is beneficial to all queries. Under this situation, apriori sampling requires storage space that is a substantial fraction of the total input size for differently stratified samples but only provides little performance improvement. Other contribution of the paper includes a new universe sampler operator that can sample both join inputs and the ASALQA algorithm which is a cost-based query optimizer capable of output execution plan with appropriate sampler and reason about its performance and accuracy. The universe sampler works by projecting the value of the join key into a high dimensional space (using a hash function) and picks all rows in both relations whose value of the join keys after projection falls in a chosen subspace. It can be shown that using this method ensures a p probability sample of the inputs is statistically equivalent to a p probability sample of the join output. This means there’s no degradation in answer quality. ASALQA algorithm works by first injecting sampler into query execution tree before every aggregation, and then applying a set of transformation rules to push samplers closer to raw input and generate new plans. Finally, data statistics are used to calculate plan costs and identify the best plan. One weakness of the paper is that it’s much harder to read than the BlinkDB paper, especially the ASALQA part. I think it will be better if it can omit some (or move to appendix) unnecessary details (for example, the detailed pseudocode for pushing samplers past join). |
In the paper "Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters", Srikanth Kandula and Co. (Microsoft) develop Quickr, a system that approximates the answer to complex ad-hoc queries in big data clusters. It does so by injecting samplers on-the-fly without the need for pre-existing samples. Due to the growth of frameworks such as Hive and SparkSQL, jobs that are represented as a mash-up of relational expressions and user defined code have become increasingly dominant. These queries attempt to approximate big data clusters, but are simply too complex and are spread out across multiple datasets. Previous works, such as BlinkDB, lack support for these complex queries. Specifically, when run with TPC-DS benchmarks, BlinkDB only benefited 11% of the queries. Thus, Quickr has four goals: 1) When given a query, decide if it can be sampled and output a query plan with samplers 2) Support complex queries 3) Don't assume input samples exist or future queries are known 4) Ensure that all answers are accurate with high probability (+- 10%) This way, Quickr is able to lazily approximate complex queries with zero apriori overhead. Thus, it is clear that this is both a necessary and interesting system to develop. Kandula divides this paper into several subtopics: 1) Approximability of Big Data Queries: One thing that Quickr analyzes is the concept of "Heavy Tails". When observing the interaction between queries and its contact with data, there is a strange phenomenon that occurs. Jobs that account for half the cluster-hours touch 20PBs of distinct files, while the last 25% of queries (i.e. the tail) access another 60PBs of files. Thus, Quickr has to optimize for queries that overlap with the most amount of data to save on storage overhead. Quickr also operates under the assumption that queries will have aggregates and joins. These will hinder approximability, but are still possible with multiple effective passes over data. 2) Just-In-Time Sampling (JIT): Quickr uses statistics of the input data sets to generate an execution plan with samplers placed at appropriate locations. The three types of sampler are: uniform sampler, distinct sampler, and the universe sampler. All three of these methods are applicable - they work commutatively with a database. Quickr considers various join orders, choices of stratification, universe columns, and choices of sampler locations to pick an appropriate plan. Consequently, many otherwise unapproximable queries benefit from Quickr. However, the distinct and universe sampler need to be analyzed for accuracy in a different way due to that nature of sampling. 3) Evaluation: There are several questions to answer: What is the query speed up? How often are the plans correct? Where do these gains come from? Well, we notice that there is a speed up of up to 6x, with the median being 1.6x. 80% of these queries lie within +- 10%, while 90% are within +- 20%. Quickr offers substantial performance gains to a large fraction of the complex queries in the TPC-DS benchmark. Since production queries have many more passes over data, there are larger overall gains. Even though Quickr is an efficient system that solves approximation problems with low overhead, it still has many drawbacks. One drawback is the lack of a "future work" section even though they kept mentioning that Quickr had promising results. I felt that Quickr was very similar to Online Aggregation, but differed in terms of three things: being able to automatically decide which queries could be sampled, placing appropriate samplers at their correct positions, and an accuracy guarantee. Thus, one could imagine the speedup they could gain if they made some assumptions about their workload (like BlinkDB or Materialized views). Another drawback (which is more of a complaint) is the density of the material within this paper. The authors tried to fit 30 pages of content into 10 pages and that made the paper quite difficult to read and comprehend. Lastly, since the authors were Microsoft employees, they could have done some of their evaluations on Microsoft workloads and assessed Quickr's performance in real life scenarios. |
This paper describes Quickr, a DBMS that uses samples on large data sets to speed up queries. Quickr builds off of a system like BlinkDB, which makes samples of large tables to be used on later queries. However, the paper acknowledges two issues with this approach that it tries to fix. First, uniform and stratified sampling only work properly when a single table is sampled; if multiple samples are joined together, the sampling of the join is inaccurate. Second, building samples ahead of time can make it difficult to match an appropriate sample to incoming queries. For the first issue, Quickr uses three kinds of table samples. The first is a simple uniform sample, which is easy to implement and has low variance, but doesn’t always get samples from groups with few rows. The second kind of sampling is stratified sampling, or distinct sampling as this paper calls it, where each group of columns is guaranteed to have a few rows in the sample. This avoids some of the problems of a uniform sample, but if there are too many groups, then the distinct sample may not be much smaller than the original table. In addition, both of these samples don’t work very well for joins; an even sampling of a joined resultset is statistically different from the join of two samples. As such, the third type of sampling is universe sampling. Each row is projected into a higher dimension space by adding a new column. This column is generated by some hash of an existing column. Then, a uniform sampling is performed on the new column. This allows sampling to persist properly across joins, as long as the joined tables are joined on the same sampled column. As for using samples, especially with many different kinds of query column sets, it becomes infeasible to select samples before queries are known. Instead, Quickr creates samples for each query as they come in. Sampling is part of the normal query plan, and is taken into account by the optimizer. The samples themselves are treated as any other operation, and can be pushed past other predicates and joins under the right circumstances. This paper’s big contributions are universe sampling and just-in-time sample creation. Being able to join samples effectively allows for samples to be much more broadly used, and creating samples only when they’re needed makes sampling much more versatile, and reduces the burden on the user to come up with effective samples. On the other hand, requiring just-in-time queries increases the complexity of query planning, which increases optimization time, and making the samples increases query runtime. In addition, the paper might be improved by a more thorough explanation of why universe sampling works. There is an intuitive statistical explanation, but not a specific formulation. |
This presents a way to inject samplers on the fly without requiring pre-existing samples to deal with complex ad-hoc queries. This paper considers the problem of approximating jobs in big data clusters. The reason to consider the problem in big data clusters is that the distribution of queries over inputs is heavy-tailed, and queries are usually complex. The paper evaluates approximability of queries: the goal of approximation is to avoid processing all the data, yet still obtain an unbiased estimate of the answer per group that is within a small ratio of the true answer with high probability, ensure that no groups are missed, and offer an estimate of the expected error per aggregation. When it comes to joins, the paper presents a few observations: 1. distribution of queries over input datasets is heavy-tailed, 2. queries typically have aggregation operators, large support, and output ≪ input, so they are approximatable, 3. multiple factors hinder approximability: queries use a diverse set of columns requiring extensive stratification, 4. queries are deep, involving multiple effective passes over data including network shuffles. The paper then presents Just-In-Time sampling. The goals of JIT sampling is to minimize overhead to the administrator, support a large fraction of the queries in SQL and big-data scenarios, performance gains should be sizable by either reducing the resource needs of a query or a faster completion time or both, offer accurate answers, which means that with high probability miss no groups. Quickr uses three types of samplers: uniform sampler, distinct sampler, and universe sampler. Each sampler passes a subset of the input rows, and the subset is chosen based on the policies. In addition, each sampler appends a metadata column representing the weight associated with the row. There are limitations and interactions between samplers. The three samplers together expand the range of applicability of samplers by integrating with the query optimizer, Quickr considers various orders, choices of stratification and universe columns, and choices of sampler locations to pick an appropriate plan. This paper describes how experiments are done to evaluate this new proposed method against others and shows the effectiveness of Quickr. The advantage of this paper is that it covers both advantages and potential issues with Quickr. With potential problems, the paper also explains approaches to solve the issues in details. The disadvantage of this paper is not enough background before diving into technical details. |