Review for Paper: 31-BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

Review 1

Users of large databases sometimes need a query answered very quickly, such that they are willing to trade off result accuracy for response time. Prior approaches to approximate query answering use various methods to sample the input data, reading only the sampled portion for faster results. Unfortunately, much earlier work assumes complete knowledge of future query types, which is unrealistic in some use cases. In contrast, methods for approximate queries like online aggregation are reasonable if queries are completely unpredictable, but they may require reading the full data to answer queries on uncommon record types within narrow confidence intervals.

The designers of BlinkDB propose a new method based on frequent “query column sets,” or groups of columns used in aggregation or filtering clauses. A set of stratified samples is maintained for each frequent query column set, and a sample selection module chooses which samples to use to respond to a new query at run time. Stratified samples allow BlinkDB to achieve good performance even for infrequently occurring row sets, such as sales in small towns for a sales database. BlinkDB uses uniform samples, where each group has the same number of samples (unless not enough are present in the input data). This minimizes the mean over data groups of expected error.

The paper’s main contribution is the design of BlinkDB, an approximate query answering module that sits on top of Hive and Hadoop. BlinkDB allows users to specify either a maximum run time or a probably approximately correct bound on the result returned. BlinkDB runs the query on small sub-samples of each of its pre-computed samples, to estimate the query’s running time and error characteristics on each sample. Using these estimates, BlinkDB produces a plan for querying the best samples, using the maximum amount of data that fits in the time bounds (or minimum that fits in the correctness bounds).

One limitation of BlinkDB’s approximate query answering scheme is that it may not work well for queries that filter on columns that were not included in the pre-computed stratified samples. In an environment with many ad hoc queries of an unpredictable nature, a system more like online aggregation may be necessary, in spite of its worst case of sampling the full data.



Review 2

The paper focuses on BlinkDB, a parallel approximate query engine for running interactive SQL queries on huge amount of data. The current day big data analytics workloads have terabytes of data which are required to be accessed within seconds by the users. These require sampling of the data and providing approximate results which trades-off accuracy for faster response times. But tools used for such tasks nowadays can’t seem to keep up with the ever-growing workload. The online aggregation system (OLA) performs poorly on rare tuples, while sampling and sketches limit the types of queries they can execute. To mitigate these problems, the authors propose BlinkDB.

BlinkDB extends the Apache Hive framework by adding two major components – an offline sampling module which works over time to create samples and a run-time sample selection module that creates an Error-Latency Profile for queries. It supports only constrained set of SQL-style declarative queries. It creates samples to target queries using a given Query Column Sets. These samples are optimized for a single query as well as for a set of samples for all queries sharing a QCS. The optimization problem takes into account three factors – sparsity of data, workload characteristics and storage cost of samples. The authors describe the runtime of BlinkDB. In order to pick the best possible plan, BlinkDB’s run-time dynamic sample selection strategy involves executing the query on a small sample of data and gathering statistics about the query’s selectivity, complexity and the underlying distribution of its inputs. They then implemented the BlinkDB ecosystem on top of the Hive Query Engine which supports both Hadoop MapReduce and Spark at the execution layer and uses the Hadoop Distributed File System.

The paper is successful in providing a comprehensive study of BlinkDB. The background section provides relevant taxonomy and each of the comparable terms are usually represented pictorially making the ideas clear. It is implemented on the existing Hive System which is used massively with today’s bigdata workloads. Its performance is off the charts with scanning range of queries within 2 seconds on 17TB of data with 90-98% accuracy.

The relational algebra in the paper is a little confusing with little explanation on its working. The queries supported at this time are limited to a handful and no mention of future work brings about the doubt on the extensibility of the system.



Review 3

Modern data analytics applications involve computing aggregates over a large number of records. Traditionally, such queries have been executed using sequential scans over a large fraction of a database. Increasingly, over the past two decades a large number of approxima- tion techniques have been proposed, which allow for fast processing of large amounts of data by trading result accuracy for response time and space. However this paper argue that none of the previous solutions are a good fit for today’s big data analytics workloads.

In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade- off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars.

An adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time. There is two folds here the first one is stratified samples making sure the rare event be sampled, and "multi-dimensional” is achieved with the information of workload which turn this problem into an optimisation problem.
On the other hand, a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy or response time requirements. They accomplish this by constructing an ELP for the query. The ELP characterizes the rate at which the error decreases with increasing sample sizes, and is built simply by running the query on smaller samples to estimate the selectivity and project latency and error for larger samples.

I like two main idea about this paper: use stratified samples on multi dimension, and use error latency profiles to provide error or response time on each available sample.


Review 4

This paper presents a new system called BlinkDB, which is a distributed sampling-based approximate query processing system. In short, BlinkDB provides the tradeoff between query accuracy and response time, and it can query over a very large amount of data (TBs) in seconds with meaningful error bars. There are two key ideas in BlinkDB: one is an adaptive optimization framework and another is a dynamic sample selection strategy. This paper first introduces the background and an overview about the system. Then it moves to the design details about sample creation and optimization, as well as query execution in BlinkDB. Next it introduces the implementation and performance evaluation. Finally, it provides related and future work.

The problem here is that modern data analytics applications involve computing over a very large amount of data and it contradicts with the need to provide immediate answer. For example, old systems might use sequential scan over a large fraction of data while new applications need real-time response rates such as online transactions. Therefore, we need to tradeoff query accuracy for response time and space. There are several exist methods such as sampling and OLA, but they all have some shortcomings. In this paper, it presents a better solution BlinkDB.

The major contribution of the paper is that it provides detailed design concerns and ideas for the new system BlinkDB. By doing performance evaluation on a traditional benchmark TPC-H and a real-world analytic workload from Conviva, it shows that BlinkDB can answer queries on up to 17TBs of data in less than 2 seconds which is over 200x faster than Hive (within an error of 2-10%). Here we will summarize key elements of BlinkDB:

1. a column-set based optimization framework (stratified samples)
2. optimization will consider the frequency of rare subgroups, column sets in past queries, and storage overhead of each sample
3. error-latency profiles (ELP) to estimate query error or response time on each available sample
4. it can be integrated into existing parallel query processing framework Hive


One interesting observation: this paper is very innovative, and the idea of using a multi-dimensional sampling strategy is very interesting. It will provide better result than previous sampling method including better performance on rare subgroups. One possible weakness is that it did not provide enough concept generation for how dynamic sample selection strategy come out.


Review 5

This paper is another in the recent string of papers attempting to address some issue of data "at scale", meaning larger than a typical one-machine system can handle. We've seen MapReduce, Hadoop, Spark, Bigtable, and SparkSQL so far. This system attempts to go one step further than the previous systems. It is designed to be fast, and provide results bounded by some percentage error with some degree of confidence. Or, the system can be used to provide the most accurate results within a given time frame. For a large amount of business intelligence systems, it is more important to get the results quickly, than to have 100% accuracy in the results - they are just looking for trends or patterns, and will generally drill down further to gain more insights. The average user does not care about the average to the 14th decimal place, or the exact number of clicks on an add - to the nearest 10, or even 100 or 1000, is probably fine.

This paper is unique in the way that it achieves its approximation. It selects several subsets of the data, and then builds sample families on top of those results. This optimization framework allows BlinkDB to return results quickly for cases as different as selecting the average in x.city = "New York, NY" vs x.city = "Westland, MI". In the New York cases, it will quickly scan the enough tuples where the city is New York to quit the scan early. In the case of Westland, it will uses the stratification to select enough tuples in Westland to create an accurate (enough) response.
One feature I do really like in the system is the ability to bound a query in response time - this would allow the database to be used for applications that are user-facing, and the time bound could be adjusted with user studies to find the correct match between accuracy and delay.

One weakness of the paper is that it assumes that the past workload is a good predictor of the future workload. While this may be true in some cases, if this system is truly meant to be interactive - that is, there are teams of data scientist querying the databases to look for insights, then the query workload could dramatically shift if the teams start looking for new patterns.


Review 6

Problem and solution:
The problem proposed in the paper is that modern data analytics app requires solving large volume and multiple dimensions data. The traditional accurate way is to execute the queries using sequential scan, which is quite slow since large tables cannot fit in memory and need multiple disk access. It can not satisfy the real time response requirement. The solution is to use approximation techniques to process the queries, which sacrifices the correctness of result but saves response time and space. Two previous approximation ways are sampling, which create an in-memory sample and execute query over the sample instead of the total dataset, and online aggregation. But these two methods has weakness. Sampling has strong limitations on the types of queries it can execute, and online aggregation is not time efficient for queries on rare tuples. So the best solution proposed in the paper is the BlinkDB which extends the Hive framework and concentrates on solving the approximation. It is a sampling based approximate query engine and is tested to have much higher performance than Hive.

Main contribution:
I think the main contribution is that BlinkDB create two modules to improve the sampling approximation so that the better balance between time and space efficiency and the accuracy is achieved. Also, they enables BlinkDB to have customized accuracy and time configurations and support more general queries. One module s the sample creation. It is offline and create stratified samples according to the past query column sets and the historical frequencies. It ensures that the engine will not overlook rate tuples and have high efficiency no matter the tuples are rare or not. Another module is the sample selection. It is executed during run time and picks the best sample by testing the query on multiple smaller sub-samples to achieve the best error rate and response time.

Weakness:
Thought BlinkDB is quite helpful to the real time response of applications, I think it has a problem. The sample creation module is executed offline. It means that the samples may not always true to the real database. If the data changes very fast, the samples are easy to be out of date. In this situation, the accuracy may not be guaranteed.


Review 7

When data for analysis is very large, querying will take long time to get response because it need much I/O to stream through all tuples. Thus using approximate way to get the result of query is a good way to achieve fast response time. The paper represent BlinkDB which is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data.

BlinkDB consists of two main modules: Sample Creation and Sample Selection.
The sample creation module creates stratified samples on the most frequently used QCSs to ensure efficient execution for queries on rare values. These samples are designed to efficiently answer queries with the same QCSs as past queries, and to provide good coverage for future queries over similar QCS.

Based on a query’s error/response time constraints, the sample selection module dynamically picks a sample on which to run the query. It does so by running the query on multiple smaller sub-samples to quickly estimate query selectivity and choosing the best sample to satisfy specified response time and error bounds. It uses an Error-Latency Profile heuristic to efficiently choose the sample that will best satisfy the user-specified error or time bounds.

Strength:
The BlinkDB can achieve high accuracy approximation without good knowledge about the workload/query. The user have the right to tradeoff between query accuracy and response time. It is better than previous approximate database using sampling and sketch and OLA.

Weakness:
The BlinkDB is not good for database that data changes very frequently as the data sampling is checked periodically.
In addition, I think it may not be good to let the user to specify the response time for the query and accuracy. As the response time and the accuracy is not linearly related, it will be better to let the user only specify the least accuracy and response time which are acceptable and let the database to tradeoff accuracy and response time and generate the “overall optimal” query plan.


Review 8

Problem/Summary:

Big data analysis has become an important issue as the size of datasets continues to increase. Processing queries on these datasets can take a lot of time to produce an answer. However, often a precise answer is not needed. Therefore, to increase performance, a database query can run on a sample of the database instead of on all the data, which vastly increases performance. BlinkDB is one database that uses this technique.

A main idea in BlinkDB is multi-column stratified samples, which perform better than random samples. The problem with random samples is that for a given column C that has only has a few rows with attribute A, the random sample may only contain a few of these rows, or may exclude this group entirely because of variance in its sampling. This leads to high error, if a query is looking for a minority group. A stratified sample biases its sampling for these smaller groups, by determining a quota for each one.

A multi-column stratified sample is associated with a set of columns. However, since the power set of columns is exponential, BlinkDB must decide which samples to keep around. This is done using an optimization function that balances the importance of the sample (determined by how many queries in the workload use it), the number of minority groups that can be helped by the stratified sample, and the storage cost of the sample.

During runtime, BlinkDB can receive a requested error bound or time bound. Based on these parameters, BlinkDB can estimate the size of sample that it should process in order to achieve these bounds. Instead of taking a new sample each time (which would decrease performance), BlinkDB selects one of its available samples (or a portion of a sample), and runs the query on that sample and returns an error estimate.

Strengths:
Even though the paper has some mathematical portions, the authors helpfully explain the intuition behind most of the expressions. The paper also preempts some objections or questions that people might have, and explains how they get around potential problems.

Weaknesses/Open questions:
In class we talked about how sampling can result in very sparse results when running queries with joins- how does BlinkDB deal with this?

Is a 1-to-1 ratio of data to metadata really feasible?



Review 9

This paper presents BlinkDB, a parallel, sampling based approximation query engine that can support ad-hoc queries. The problem is that when issuing queries that aggregate data such as taking an average of a column, the query may take a long time to finished if the amount of data is large. However, an exact answer is not always necessary and sometimes it is fine to return an approximate answer, especially if low latency is required. It can be difficult, though, to choose the number of samples that will provide an accurate representation of the data in the excepted response time. BlinkDB allows a query to specify a standard deviation or expected response time and BlinkDB will choose the minimum number of samples to satisfy these constrains. It is able to do so based on a multi-dimensional sampling strategy and a run-time dynamic sampling selection strategy. These strategies lead BlinkDB to create samples based on queries that share a QCS and store them for use later. Furthermore, the stratified samples created are based on multiple columns rather than a single column. In order to minimize the storage overhead of the samples, BlinkDB maximizes a goal function that considers sets of columns which appear often in recent queries or contain subsets that appear often.

BlinkDB seems like a simple idea, but the actual mechanisms confused me. I get that the observation is that when taking aggregates, I am not worried about having the exact answer down to a decimal point, but rather the general idea. The problem I have is understanding the actual methodology that allows BlinkDB to satisfy latency or standard deviation requirements. The paper introduced tons of terms and I got lost in having to reread what each term meant, especially during the discussion of the goal function for minimizing storage overhead. Another thing that wasn't clear to me was whether BlinkDB could support an ad-hoc query that specifies both a latency and standard deviation requirement. In my mind, this would be hard to do because I think if you want to increase accuracy, then latency would have to decrease. One option might be to attempt to support it, but if the system is unable to, then it returns several sets of samples. One which minimizes latency, one which minimizes error, and one that minimizes the "distance" from the original specification.


Review 10

In database query processing, there is a trade-off between query answer accuracy and response time (latency.) In particular, for data warehouse environment, approximate but fast query answering is more desirable over exact but slow query answering.

This paper introduces BlinkDB, an approximate query answering engine to address such problem. The distinctive characteristics of BlinkDB is that it is adaptive and dynamic -- it samples data from the massive database based on the quality of query answering including both accuracy and response time. It focuses on the "aggregates" type of queries.

The important parts of BlinkDB include "sample creation" and "sample selection." The point of sample creation is to decide on the sets of columns on which we build samples. It is treated as an optimization problem in this study. Sample selection is to choose an appropriate sample for a query. The selection criterion is often based on the query’s response time or error constraints

The contribution of this paper includes:
1. It formulates the problem of approximate query processing.
2. It presents the design and implementation of BlinkDB in detail.
3. It provides experiment results on BlinkDB to show its performance with clear experiment settings descriptions.


Review 11

Motivation for BlinkDB:

Modern applications, like loading ads on Facebook or Twitter or detecting the service provider or geographic location of poor, have a need for real time response rates. BlinkDB, a parallel, approximate query engine, lets users run SQL aggregation queries over large volumes of data with response time or error bound constraints. Trading off accuracy for response time, BlinkDB runs queries over massive data, by running the queries on data samples, and presents the results with meaningful error reports. Queries over multiple terabytes of data can be answered in seconds, with meaningful error bounds relative to the answer obtains to a query run on full data. Blink DB stands out because it supports more general queries, making no assumptions about attribute values in WHERE, GROUP BY and HAVING clauses, or the distribution of values used by aggregation functions. BlinkDB assumes the sets of columns used by queries in WHERE, GROUP BY, and HAVING clauses to be stable over time. There sets of columns are called “query column sets” (QCS).

Details of BlinkDB:

BlinkDB is implemented on Hive/Hadoop and Shark, with little change to the underlying query processing system. The main ideas of BlinkDB are a set of multi-dimensional stratified samples, from original data over time, that are built and maintained by an adaptive optimization framework and the dynamic selection of appropriately sized samples based on the query’s accuracy or response time requirements. One of the two main modules of BlinkDB is Sample Creation. Sample Creation creates stratified samples, which are rare subgroups that are over-represented relative to uniformly random samples, on the most frequently used QCSs to ensure efficient execution for queries on rare values. Sample creation is an optimization problem, where given a collection of past QCS and historical frequencies, we choose a collection of stratified samples with total storage costs below user configurable storage threshold. These samples efficiently answer queries with same QCS as past queries, and cover future queries with similar QCS well. For distributions with stable QCSs, this creates samples that are not over nor under specialized. The second main module, Sample Selection, is based on query’s error/response time constraints, from which it dynamically picks a sample to run the query on. The query is run on multiple small sub-samples, which are stratified across a range of dimensions, to estimate query selectivity and choose the best sample to satisfy specified response time and error bounds. The Error-Latency Profile heuristic efficiently chooses the sample that will best satisfy the user-specified error on time bounds. While other approaches calculate single sample per table, BlinkDB computes a set of stratified samples with a column-set based optimization framework. This way considers the frequency of the subgroups in the data, the column sets in the past queries, and the storage overhead of each sample. Error-latency profiles (ELP) are created at runtime for each query estimates its error or response time for each available sample. Thus, the most appropriate sample that meets the query’s response time or accuracy requirements is selected. In the experimental results, BlinkDB can answer range queries that are 17TB in 2 seconds with an accuracy of 90-98%, two orders of magnitude faster than queries on Hive/Hadoop.

Strengths of the paper:

I liked that BlinkDB was experimented on real-world production workloads from Facebook and Microsoft. I also liked the concept of BlinkDB, as an approximate query engines, giving users freedom to set their own response time and error bound constraints, allowing them to choose what they want to emphasize more for each query situation: accuracy or speed.

Weaknesses of the paper:

I would have liked to see a discussion on how to best choose a balance between response time and error bound, and a quantitative evaluation on how negatively an increased error bound would affect the results. I would have also liked to see how the performance of BlinkDB compares to other approximation queries.



Review 12

The BlinkDB paper takes previous work in approximate computing and presents a database system that allows users to decide how to tradeoff between computing time and accuracy. This paper looks at column sets in past queries and their overhead in order to determine how to most efficiently produce stratified samples that represent subgroups used in grouping/filtering and non-grouping/filter queries. Error latency profiles are used to determine which samples to use to meet the user specified criteria for response time and accuracy. The authors discuss their implementation and how it can be integrated with other database technologies. The query processing is evaluated on the TPC-H benchmark on an EC2 cluster.

This paper has a few main strengths. Aside from being a free database technology that can be integrated into other systems, this paper presents methods for solving sampling problems that existed in previous work on approximate computing, including papers like “Join synopses for approximate query answering” which do not sample appropriately for filtering or grouping queries. Most notably, the ability for users to specify the desired accuracy or computing time is not a feature that was previously seen in approximate computing technologies. This paper proves the usefulness of these types of queries by evaluating on the TPC-H benchmark. The authors show results that demonstrate BlinkDB’s effectiveness on a variety of queries with diverse time and error constraints.

This paper works well under the assumption that future queries will be similar to past queries. This probably works well for a large group of applications. Facebook for instance, deciding to show different advertisements depending on the current status of social media is very appropriate for this type of query processing. However, things like exploratory data analysis on large sets of business data would not work well with BlinkDB. It would be difficult to predict the types of queries that will be interesting to the user.



Review 13

Part 1: Overview

This paper presents a parallel, approximate, and large scale database for interactive SQL queries. To achieve better response time, BlinkDB uses an adaptive optimization framework and can dynamically select appropriately sized sample. BlinkDB is designed for online analytical query workloads where people typically do “roll up” web clicks and do a variety of operations of different dimensions. In approximate database area, people are using either sampling and sketch based techniques which leads to low algorithm complexity however requires strong assumption of the workload, or people do online aggregation which requires a lot of computation. BlinkDB strives to achieve a better balance between generality and efficiency.

BlinkDB consists of two major modules, sample creation and sample selection. Sample creation is actually an optimization problem where cost constraints are applied and samples with higher historical frequencies should be created. Based on the query’s error or response time, the sample selection module would simply run the query on multiple smaller sub-samples. Error latency profile is used as a heuristic criteria to choose the best sample to satisfy specified response time. Implementing a query processor is really hard. As other sample based query processors, BlinkDB also has its sample creation module which would create similar queries to the workload and at the same time try to avoid overfitting.

Part 2: Contributions

For interactive databases, response time is the crucial concern. BlinkDB allows users to trade off query accuracy for response time. For terabytes data size, efficiency is really hard to achieve.

Both TPC-H workloads and real world analytical data are used in experiment. Results are shown that BlinkDB can answer queries on up to 17 TBs data in less than 2 seconds within an error of 2-10%, which means BlinkDB can outperform Hive by being over 200 times faster.

Part 3: Drawbacks

BlinkDB is still sampling based and may get high error rates when the workloads becomes not typical. They assume predictable QCSs. In practice, queries may be hard to predict. BlinkDB requires a slightly constrained version of SQL-style declarative queries.



Review 14

The paper discusses about BlinkDB which is an approximate query processing engine with bounded errors and response time. Current interactive data analytics applications involve query processing over very massive sets of data. Such kind of applications requires near real-time response rates which is becoming impossible if considering all data sets. BlinkDB allows fast response time for interactive analytical queries at the expense of some bounded errors. In order to achieve this, it employs two key ideas. The first idea is that it provides an adaptive framework that builds and maintains a set of stratified samples from original data over time. It chooses the best set of samples by considering the column sets int eh past queries (query column sets), data distribution and storage costs of each sample into consideration.

The second idea is BlinkDB implements a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements. In order to do this BlinkDB creates error-latency profiles for each query at runtime to estimate its error or response timon on each available sample. This information is then used to select the most appropriate sample to meet the query’s response time or accuracy requirements.

The main strength of the paper is that it provides faster response time for interactive queries on very large data sets. This kind of problem is becoming important with the exponential growth of data in the web. Furthermore, the fact that it guaranty on the bound on both the errors and response time make it pragmatic from the user perspective. The main problem with approximate query processing is in providing bound on the amount of the error. I like the fact that BlinkDB address this issue in more depth.

The main limitation BlinkDB is that it assumes the sets of columns used by queries in WHERE, GROUP BY, and HAVING clauses to be stable over time. This might not be the cases depending on the kind of workloads. In this sense, it limits the applicability of BlinkDB for different kidn of workloads. This s Furthermore, it supports only some aggregate operators: COUNT, AVG, SUM and QUANTILE. For example, It doesn’t support some joins and nested SQL queries which are also main part of most SQL queries.



The paper discusses about BlinkDB which is an approximate query processing engine with bounded errors and response time. Current interactive data analytics applications involve query processing over very massive sets of data. Such kind of applications requires near real-time response rates which is becoming impossible if considering all data sets. BlinkDB allows fast response time for interactive analytical queries at the expense of some bounded errors. In order to achieve this, it employs two key ideas. The first idea is that it provides an adaptive framework that builds and maintains a set of stratified samples from original data over time. It chooses the best set of samples by considering the column sets int eh past queries (query column sets), data distribution and storage costs of each sample into consideration.

The second idea is BlinkDB implements a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements. In order to do this BlinkDB creates error-latency profiles for each query at runtime to estimate its error or response timon on each available sample. This information is then used to select the most appropriate sample to meet the query’s response time or accuracy requirements.

The main strength of the paper is that it provides faster response time for interactive queries on very large data sets. This kind of problem is becoming important with the exponential growth of data in the web. Furthermore, the fact that it guaranty on the bound on both the errors and response time make it pragmatic from the user perspective. The main problem with approximate query processing is in providing bound on the amount of the error. I like the fact that BlinkDB address this issue in more depth.

Although I like the fact that approximate query is necessary for dealing with the current problems large data sets, approximate results might not be acceptable for some applications. Furthermore, different techniques including in-memory database are poised to provide faster exact query processing. With this platforms, exact query processing is becoming acceptable even for interactive queries. Consequently, it is important to evaluate different platforms for exact query processing and reevaluate if approximate query processing is still relevant. In addition, BlinkDB assumes the sets of columns used by queries in WHERE, GROUP BY, and HAVING clauses to be stable over time. This might not be the cases depending on the kind of workloads. In this sense, it limits the applicability of BlinkDB for different kind of workloads. This s Furthermore, it supports only some aggregate operators: COUNT, AVG, SUM and QUANTILE. For example, It doesn’t support some joins and nested SQL queries which are also main part of most SQL queries.


The paper discusses about BlinkDB which is an approximate query processing engine with bounded errors and response time. Current interactive data analytics applications involve query processing over very massive sets of data. Such kind of applications requires near real-time response rates which is becoming impossible, if all data sets have to be considered. BlinkDB allows fast response time for interactive analytical queries at the expense of some bounded errors. In order to achieve this, it employs two key ideas. The first idea is that it provides an adaptive framework that builds and maintains a set of stratified samples from original data over time. It chooses the best set of samples by considering the column sets in the past queries (query column sets), data distribution and storage costs of each sample into consideration.

The second idea is BlinkDB implements a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements. In order to do this BlinkDB creates error-latency profiles for each query at runtime to estimate its error or response time on each available sample. This information is then used to select the most appropriate sample to meet the query’s response time or accuracy requirements.

The main strength of the paper is that it provides faster response time for interactive queries on very large data sets. This kind of problem is becoming important with the exponential growth of data in the web. Furthermore, the fact that it guaranty bounds on both the errors and response time make it pragmatic from the user perspective. The main problem with approximate query processing is in providing bound on the amount of the error. I like the fact that BlinkDB address this issue in more depth.

Although I like the fact that approximate query is necessary for dealing with the current query processing over large data sets, approximate results might not be acceptable for some kind of applications. Furthermore, different techniques including in-memory database are poised to provide faster and exact query processing. With this platforms, exact query processing is becoming acceptable even for interactive queries. Consequently, it is important to evaluate different platforms for exact query processing and re-evaluate if approximate query processing is still relevant. In addition, BlinkDB assumes the sets of columns used by queries in WHERE, GROUP BY, and HAVING clauses to be stable over time. This might not be the cases depending on the kind of workloads. Furthermore, it supports only some aggregate operators: COUNT, AVG, SUM and QUANTILE. For example, It doesn’t support some joins and nested SQL queries which are also main part of most SQL queries.


Review 15

===OVERVIEW===
BlinkDB is presented in this paper. BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB uses two key ideas:
1. an adaptive optimization framework (sample creation)
2. a dynamic sample selection strategy (sample selection)
The authors evaluated the database against TPC-H and a real-world analytic workload from Conviva. According to their experiments, BlinkDB is able to answer queries on up to 17TBs of data in less than 2 seconds, which is more than 200 times faster than Hive, within an error of 2-10%.

=== PROBLEM===
Existing techniques include: 1. sampling, 2. sketches, 3. on-line aggregation. The paper use the following example query to show the advantages of BlinkDB. "SELECT AVG(SessionTime) FROM Sessions WHERE City = 'New York';" First, using sampling technique on large set of data is able to provide confident bounds on the accuracy of the answer. But they, include sketches, make strong assumptions about the query workload. Second, even though OLA makes fewer assumptions, the performance varies much. Usually poor performance on rare tuples.

===STRENGTH===
BlinkDB makes no assumptions about the attribute values in the WHERE, GROUP BY, and HAVING clauses, or the distribution of the values used by aggregation functions. It only assumes that the sets of columns used are stable over time. The two main modules are used to 1. ensure efficient execution for queries on rare values
2.provide satisfied samples on the most frequently used query column sets.
It dynamically decides which samples to create and how large the sample should be to satisfy response requirement and storage constraint, which is a large improvement over pervious methods of approximation. These notions of adaptive optimization and dynamic sample are the key contributions of the paper.

===WEAKNESS===
I really like this paper, where the purpose is well stated and all the former methods are presented with reasonings and comparisons. More real world example and test cases can improve the quality of this paper.


Review 16

This paper describes BlinkDB, a database capable of computing approximate queries on a large set of data, assuming workload has only Predictable Query Column Sets (QCSs). The key difference between BlinkDB and previous works on approximate database is that it allows query writer to give desired bound on either Error rate or response time. By providing this choice, BlinkDB gives the query writer freedom and confidence when doing approximate computing.
This database is built on top of Hive/Hadoop and has mainly two components: an offline sample creation module and an online sample selection module. The sample creation module takes a QCS as input and takes samples based on value group counts. Uniform sampling is not used because it does not work well for queries that filter the table. Then during runtime, BlinkDB extrapolates response time and relative error to construct an Error Latency Profile (ELP) of the executing query. ELP enables fast query plan evalutation as well as suggest right sample/size. In this way, providing bounded repose time or error rate is possible.

Contribution:
As this paper summarizes, it has the following contribution:
1) Extract frequently used column sets in past queries,
2) Created error-latency profiles to estimate error rate/response time.
3) Integrated this work into existing framework with minimal change.
I think what make this paper interesting and important is the second point. Enables the tradeoff between time and error rate makes approximate query more attractive, because people can make adjustment base on this.

Weakness:
One question arises when I was reading about the offline sample creation and runtime sample selection part. Rows in database may change quickly. An important question is how often the sample needs to be created again.



Review 17

BlinkDB is a parallel approximate query engine for running SQL queries for data analytics on huge data. Compared to typical approx. DBs, BlinkDB uses the model of predictable query column set over time. A query column set is a set of columns that typically appear in the Where, group by, having clauses of SQL queries and are assumed to be stable over time. Another specific quality about BlinkDB is the use of ELP (error-latency profiles) to decide the sample to use to run the query. Since the user can specify the acceptable error percentage, the level of confidence and time within which they expect an answer, BlinkDB uses these parameters to evaluate their stratified sampling strategy.

One of the major advantages that came through was their use of stratified sample sets. Unlike their approx. DB counterparts, using stratified sample sets over uniformly distributed sample sets, they do end up saving a lot of time by avoiding complete table reads for even the rare column set. They have also shown how their column set generalizes well with the workloads over a period of time. At run time, they check if the column specified in the predicate is a subset of any of the already built sample sets present in-memory; if it is, they run the query over that sample. If no such sample set is found, the query is run in parallel on in-memory subsets of all samples currently present. They also run a background task that periodically updates the sample sets, both uniform and stratified samples. This would ensure that the user does not get stale results.

They have specified that they do not provide all possible joins or nested query capabilities because the query can be easily flattened. However, since they have extended the HiveSQL Parser to enable the Sample Selection module, I think they could have easily added the component for flattening these queries as well therefore, not letting the user have to simplify the query before the execute it. One more thing that was not so clear is that if all the possible columns are essentially represented in the form of samples. This DB also operates under the assumption that the query column sets are predictable. Considering a particular column/column sub-set is not present or the user types in an atypical query, it would involve a complete table read anyway which reduces the effectiveness of this DB. Considering the stratified sample sets take anywhere between 5 and 30 minutes to be updated, it might be too expensive to update it on a daily basis in case of a dynamically changing workload (for eg: a stock market system).

Overall, BlinkDB definitely has used some of the best mechanisms to ensure the results with a great accuracy as per the user requirements.



Review 18

Review of ‘BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data’
More and more data analytics applications are demanding responsive aggregation results over a large amount of data storage. However, the default calculation method adopted by most traditional DBMS is just a linear scan of all the retrieved records by the query. Yet, newer databases operator model like data cube can only support distributive and algebraic functions. As we noticed, increasingly, most interactive applications require near real-time response rates. For example, adjusting the network bandwidth according to the average packet round trip time of all users in a certain area. And in most cases, the applications can usually tolerate certain amount of error, but the requires a fast response.

Unlike existing unstable sampling and sketches techniques which make strong assumptions about the future queries, BlinkDB as a new solution supports more general queries and does not count predicting attribute values in the aggregation functions. However, as a tradeoff, BlinkDB requires tolerance for a certain level of inaccuracy and cares more about the response time over massive number of data records. It contains two main modules:

a. Sample Creation
Sample creation module is an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples. It is in charge of creating stratified samples on the most frequently used QCSs(query column sets) to ensure efficient execution for queries on rare values.
b. Sample Selection.
Sample selection module uses a dynamic sample selection strategy whcih selects an appropriately sized sample according to accuracy or response time requirements.

BlinkDB is also extending the Hive framework by adding an offline sampling component that creates and maintains samples, and a runtime sample selection conponent which will create Error-Latency Profile(ELP) for queries.

To sum up, this paper has strength in both theoretical improvement and implementation achievement. BlinkDB provides a flexible mechanism to balance response time and error rate which will play an important role in the future interactive big data systems. And by providing offline sampling and runtime sampling selection module for the open-source project hive, the author proves the merits of those approximation strategy proposed in this paper.

Nevertheless, there are still some concerns about this paper:

1.Low ramp-up time for new query pattern: BlinkDB sampling are only built over frequently accessed columns , with constantly changing query patterns( on a wide range of columns), the system latency may still be considerably large.

2.Supporting only a few aggregation database operations: In BlinkDB, the errors bounds and response time bounds are only guaranteed for certain aggregation database operations like AVG() and Count(), yet for others, it may fail. For example, calculating the minimal value of one column in a joined table that contains hundreds of millions of records.

3.Fail the high accuracy job: In order to provide near real-time interactive queries on massive data, the system runs queries on sampled data and provides results in a certain confidence interval. But for applications that requires high accuracy, BlinkDB cannot guarantee low latency, at least not near real-time.






Review 19

In this paper, the authors represent a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data called BlinkDB. Users can specify the accuracy of query result and the response time that enable interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars.

There are two key ideas of the BlinkDB: sample creation and sample selection.
Sample creation module is based on historical frequencies and past QCS, it mainly choose a set of stratified samples with total storage costs that below some user configurable storage threshold.
Sample selection is built to select the sample for processing the query.  It uses an ELP heuristic to efficiently choose the sample that will best satisfy the user-specified error or time bounds.

Advantage:
When data is extremely large, querying will take long time to get response because of disk access. Approximation database like BlinkDb is good for fast response. Also it lets the user to tradeoff between query accuracy and response time and is better than Sampling and sketch based database and OLA.

Disadvantage:
(1) BlinkDB is very useful when the accuracy is not highly demanded. But it maybe worse than traditional database when user specify a very high accuracy requirement.
(2) As the data sampling is off-line, it is not suitable when the data changes very frequently.
(3) As the BlinkDB uses the previous workload to predict the future queries and it periodically check to change the set of samples, it achieves robust to variation in query workload. Is the workload information before the workload changes really not useful? Will that be better if have some strategies like “forgetting factor” that also take the history workload into account with descending weight instead of totally forget the very old workload history.




Review 20

This paper presented the BlinkDB query engine for providing approximate results to interactive SQL queries. Systems like Hive provide users with a SQL interface for running queries on very large datasets. BlinkDB takes this work one step further by using approximation to provide users with an interactive interface. BlinkDB is able to provide results either by a latency deadline or to meet an error target.

BlinkDB identifies which column sets are used in queries most frequently and samples data for these column sets offline to speed up computation. By using two different sampling techniques, BlinkDB is able to take advantage of the speed of uniform sampling when its appropriate and use more advanced techniques when it is insufficient.

By layering BlinkDB on top of Hive, the authors demonstrated that this work benefits a system that is in commercial use. BlinkDB also makes very few assumptions about the datasets and is therefore generally useful. However, because BlinkDB adds new keywords to the SQL language, adoption will be difficult. Additionally, Figure 10(c) (which shows query latency as a function of cluster size) appears to be flat for some of the configurations. This suggests that the system is being underutilized and it would be nice to see results where a larger load is placed on the system.


Review 21

This paper presents an approximate database system called BlinkDB that is a massively parallel system for running approximate queries on large volumes of data. A lot of the database research thus far has been on exact query responses, but as databases start to get bigger and we start to see queries that do not need exact answers, approximate databases are able to solve many of the current problems. Furthermore, approximate queries will also allow faster processing of large volumes of data.

What makes BlinkDB different from other approximate query databases are the following contributions:

1. It uses a column based optimization framework instead of a table based optimization framework which allows it to take into account the frequency of rare subgroups, the column sets in past queries, and the storage overhead of each sample.
2. It uses error bounds and latency bounds for each query to select the best samples to use on the current query.
3. It is robust to many different types of workloads so that it can support general queries and make no assumptions about the workload.

Here are some positives that I see with BlinkDB and the paper:

1. The concepts of BlinkDB can be easily integrated with any existing parallel database. For example, we can develop a similar framework on top of Hive very easily.
2. The system allows users to set the error and confidence bounds or time bounds for any query. This gives the programmer more control over the time, accuracy tradeoff.
3. The authors gave detailed analyses of many parts of the system including storage overhead, optimization and bias correction.

There are also several weaknesses with the paper:

1. The majority of the paper was very theoretical with proofs and in depth explanations of the math behind the system. I would have liked to see some examples of the system running approximate queries with some sample data.



Review 22

This paper discusses BlinkDB, a parallel approximate query engine. One of the main focuses of BlinkDB is providing low-latency responses to queries over massive amounts of data. Many aggregate queries such as averages do not need to be exactly accurate. When some degree of error is permissible in query responses, response time becomes much more valuable than accuracy. This is especially true of the large distributed databases that BlinkDB is designed for, which often have many terabytes of data distributed across many networked nodes. BlinkDB allows users to specify a required response time or error bound so that the user can tune the balance of accuracy and speed desired in the query.

BlinkDB creates and maintains stratified samples of data that are designed to be representative of the distribution of values throughout the full database. The authors of this paper recognized that it is easy to “over fit” stratified samples to past workloads, creating samples that are not well adapted to ad hoc queries or workloads that vary even mildly from the queries used to create the samples. Therefore, BlinkDB uses a Predictable QCS model, in which it is assumed that the frequency of sets of columns used for grouping and filtering in queries will not change over time, but the values that they group and filter on will change. This model is flexible enough that it allows for efficient precomputation of samples and creates samples that are well suited to a broader range of queries than are supported by other models.

BlinkDB stores many sets of precomputed samples that range in size and range of data. Keeping so many samples incurs a fairly high storage cost (the authors tested with sample storage ranging from 50% to 200% of the total database size), but doing so allows BlinkDB the flexibility to choose the most appropriate sample at runtime based on the users speed and efficiency criteria. The authors showed that BlinkDB was able to answer queries with between 90 and 98% accuracy in almost all cases and could do so in a fraction of the time required to answer queries exactly on frameworks such as Hive with/without Spark.

My chief concern with this paper was that BlinkDB was not tested against existing approximate databases. While it was very impressive to see the performance increases it demonstrated over exact databases such as Hive, I would have liked to see a comparison with a more similar product. Additionally, I would have liked to see more discussion of how BlinkDB performs when space is more limited. All of the test cases assumed that the space allocated for samples was at least 50% of the total database size. Given that this system is designed for massive databases that store terabytes of data, such a large storage cost is a potentially expensive problem.



Review 23

This paper introduces BlinkDB, which is a massively parallel, approximate query engine for running SQL queries on large volumes of data. For queries on large data, users have to trade-off query ac curacy for response time. BlinkDB outperforms previous approaches for approximate query, including sampling and online-aggregation. Sampling puts too much assumption on the data, while online-aggregation make fewer assumptions at the expense of highly variable performance. Therefore, this paper presents BlinkDB, and implements it on several distributed database systems.

First, the paper talks about the architecture of BlinkDB. BlinkDB consists of an offline sampling module that creates samples, and a run-time sample selection module that creates an Error-Latency Profile (ELP) for queries. BlinkDB supports a slightly constrained of SQL-style queries. For sample creation, BlinkDB creates a set of samples to accurately and quickly answer queries. It creates stratified sample on given set of columns. On the other hand, run-time sample selection module tries to select one or more samples at run-time that meet the specified time or error constraints.

Second, the paper talks about the implementation and results of BlinkDB. BlinkDB is built on top of the Hive Query Engine, supports both Hadoop MapReduce and Spark. The implementation adds a shim layer to the HiveQL parser to handle the BlinkDB Query Interface, which enables queries with response time and error bounds. As a result, BlinkDB is able to answer a range of queries within 2 seconds on 17 TB of data with 90-98% accuracy.

The strength of the paper is that it provides a complete description of BlinkDB in details, including the motivation, architecture, implementation, and evaluation. The paper has strong motivation by giving the example of using approximate queries in real life, and how BlinkDB solves the problems of previous approaches.

The weakness of the paper is that it does not provide many examples when introducing the systems of the BlinkDB. I think it would be better to provide some simple queries and datasets when talking about sample creation and run-time sample selection.

To sum up, this paper introduces BlinkDB, a parallel, sampling-based approximate query engine that provides support for queries with error and response time constraints.



Review 24

This paper is an introduction to BlinkDB, which is an approximate query processing engine that sacrifices accuracy for speed. It is capable of performing over 200x faster than Hive while giving you an error bar of 2-10%. This is very useful if you don’t need to know exact answers, you just want something close and fast.

BlinkDB is built up of two main parts; sample creation and sample selection. The sample creation part creates stratified samples on the most frequently queried columns. What this means is rare subgroups in columns are over represented what they would be just from a random sample (by essentially capping common values from exceeding a given count within that sample). The sample selection process is dependent on the query error or response time constraints (the two things you can constrain a query on in BlinkDB).

BlinkDB operates under the assumption (that they have backed up with evidence) that queries can be predictable, which allows them to trade off flexibility in the system for higher efficiency.

Syntax of BlinkDB has two main differences/additions from SQL. It has an “ERROR WITHIN x% AT CONFIDENCE y%” for a query, as well as a “WITHIN x SECONDS”. These are how the user can specify what constraints they want on the query. I think this is a very practical and clean addition to SQL and really liked this part.

BlinkDB is built on top of Hive which shows that it is practical in industry as it works wel with the industry standard engines. They added some things for offline sample creation and online error latency profiles for querying. There is an excellent diagram in the paper (Figure 7) that shows how BlinkDB is built on top of Hive and what features it utilizes. This was one of the better diagrams in the paper in my opinion. In summary BlinkDB query interface->Hive Query Engine->(Uncertainty propagation & Sample Selection)->(Hadoop, Shark, BlinkDB)->Sample Cration and Maintenance->Hadoop.

One weakness I had with this paper were the charts were hard to understand. The axis on the charts did not make sense so it was hard to tell what was trying to be portrayed on them. Especially for the part where they were trying to prove their justification of predictable queries, I believe them but the charts made no sense and provided no backup evidence to me.

Overall I think this was a very strong paper. I think there is a definite use for this in industy as I think more and more queries aren’t looking for exact numbers but moreso just numbers in the general area to make their point. This paper has showed that there is a solution for that that is capable of providing extremely high performance. It was also honest with it’s results showing things like its error rate on how close to the error rate the user requests the query to be (meta). I think this paper was definitely worth the read and there is a definite use for it in industry in my opinion.



Review 25

Spark/MapReduce are excellent systems for aggregating commutative/distributive results over data spread over very large distributed systems. However, these systems are not all-inclusive regarding the types of queries they can cover with some degree of efficiency. BlinkDB aimed to tackle the problem of data retrieval over massive datasets through approximation; how can a SQL-like interface be provided to query massive datasets without necessarily having to make all of the promises that relational databases make (i.e. approximation is okay!)? These kinds of error tradeoffs are quite useful in high-latency approximate systems such as recommendation engines or help diagnostic systems.

BlinkDB is a data warehousing system build on top of Shark/SparkSQL meant for providing fast approximate answers to SQL-like ad-hoc query requests. By sampling small subsets of the data, they gain performance while bounding error at the same time. Error and confidence is also provided as metadata along with results with the query, with the opportunity to tune parameters such as sample size to fit the needs of the DBA. Part of the optimality of the sampling comes from constructing sample sets of differing granularity offline from native tables and materialized views, based on query history and workload composition. They investigated techniques such as bootstrap/jack-knife for estimating error on arbitrarily defined user functions. In addition, rather than randomly and uniformly sampling data, they are able to retrieve another degree of accuracy by using stratified sampling procedures (e.g. only sample non-rare values that appear more than K times). Choosing which samples to take, however, is a complicated problem; how does one know which pieces of data are best? What they do is start by building “query templates” based on past data and picking from columns that are best suited for these templates and avoiding overfitting by weighting columns that are able to answer queries that are not as common in the query history.

One thing I think that would’ve made this paper more thorough in its description is an explanation of failure modes (i.e. what kind of workloads would this be bad for?). Similarly, a description of the query workloads would give more insight into the upsides of BlinkDB. Lastly, each workload only consisted of 40 queries; it would be really interesting to see how well the performance of the system behaves long-term, as it IS supposed to be based on historical query data. Perhaps the system behaves well on some short-term period of query information, but becomes less useful on consistently changing workloads.



Spark/MapReduce are excellent systems for aggregating commutative/distributive results over data spread over very large distributed systems. However, these systems are not all-inclusive regarding the types of queries they can cover with some degree of efficiency. BlinkDB aimed to tackle the problem of data retrieval over massive datasets through approximation; how can a SQL-like interface be provided to query massive datasets without necessarily having to make all of the promises that relational databases make (i.e. approximation is okay!)? These kinds of error tradeoffs are quite useful in high-latency approximate systems such as recommendation engines or help diagnostic systems.

BlinkDB is a data warehousing system build on top of Shark/SparkSQL meant for providing fast approximate answers to SQL-like ad-hoc query requests. By sampling small subsets of the data, they gain performance while bounding error at the same time. Error and confidence is also provided as metadata along with results with the query, with the opportunity to tune parameters such as sample size to fit the needs of the DBA. Part of the optimality of the sampling comes from constructing sample sets of differing granularity offline from native tables and materialized views, based on query history and workload composition. They investigated techniques such as bootstrap/jack-knife for estimating error on arbitrarily defined user functions. In addition, rather than randomly and uniformly sampling data, they are able to retrieve another degree of accuracy by using stratified sampling procedures (e.g. only sample non-rare values that appear more than K times). Choosing which samples to take, however, is a complicated problem; how does one know which pieces of data are best? What they do is start by building “query templates” based on past data and picking from columns that are best suited for these templates and avoiding overfitting by weighting columns that are able to answer queries that are not as common in the query history.

One thing I think that would’ve made this paper more thorough in its description is an explanation of failure modes (i.e. what kind of workloads would this be bad for?). One thing I can think of would be how this would operate on high-entropy data, since their stratified sampling and analysis methods I think are well-suited for data that roughly follows specific distributions. Similarly, a description of the query workloads would give more insight into the upsides of BlinkDB. Aside from TPC-H, there appears to be little reasoning on the specific composition of workload query history especially since BlinkDB’s supported approximations are limited to COUNT, AVG, SUM, and QUANTILE. Lastly, it would be really interesting to see how well the performance of the system behaves long-term, as it IS supposed to be based on historical query data. Perhaps the system behaves well on some short-term period of query information, but becomes less useful on consistently changing workloads.


Review 26

This paper talks about BlinkDB, a distributed sampling-based approximate query processing system for analytics workloads on large volume of data; based on two key ideas: (1) a multi-dimensional sampling strategy that builds and maintains a variety of samples, and (2) a run-time dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy or response time requirement. The motivation behind the paper is that nowadays more and more applications demands near real-time response rates from a large volume of data collection, thus making room for emergence of approximate queries which allow fast processing of the data by trading result accuracy for response time and space. Many techniques and approaches have been proposed, with different tradeoffs, but they does not seem to meet today’s requirement for analytics database.

The paper starts with describing the four possible approaches to model workload taxonomy: predictable queries, predictable query predicates, predictable query column set (QCS), and unpredictable queries. Among those four, the test of query pattern in a production cluster shows that QCSs are relatively stable over time, which suggests that the past history is a good predictor for workload. BlinkDB uses QCS. The next section discusses BlinkDB architecture, in which it extends Apahe Hive framework by adding an offline sampling module that creates and maintains sample overtime; and a run-time sample selection module that creates error latency profile (ELP) for queries. BlinkDB can be used along with ordinary SQL queries with the addition of specified error bounds or response time. The next sections elaborates on sample creation (stratified samples and optimization framework), BlinkDB runtime (selecting sample, selecting sample size, bias correction), and BlinkDB implementation. There is performance evaluation using Convica Inc.’ workload, in which the paper compares BlinkDB to query execution on full-sized datasets, evaluates the accuracy and convergence properties, evaluates the effectiveness of its cost models and error projections, and demonstrates BlinkDB’s ability to scale gracefully with increasing cluster size. Last, the paper mentions several related works on sampling approaches, online aggregation, and materialized views, data cubes, wavelets, synopses, sketches, and histograms.

The contributions of this paper are: (1) demonstrating the use of column-based optimization framework to compute a set of stratified samples, (2) using ELP for selecting the most appropriate sample that meets the response time/accuracy requirement, and (3) showing how to integrate BlinkDB with into existing parallel query processing framework (Hive) with minimal changes. I think the last item is quite significant, as approximate queries processing system is only used for analyzing existing data, so ease of integration is one key aspect in choosing one.

However, the paper does not mention whether it is feasible to make it work with other kind of DBMS. As I mentioned before, approximate queries applications relies on the existence of previous data, which are usually generated and stored in a fix DBMS. For example, I would like to see how it would work with existing commercial relational DBMS like Oracle (whether it is possible and what the challenges are). Another thing, in case of data updates, how would BlinkDB works? How would it create/update samples? I assume that this paper uses the assumption that it works on OLAP. But even in OLAP data can change. --



Review 27

Summary:
This paper introduces BlinkDB, an approximateDB that builds for massive scale analytical problem. The main feature of this system is that it can retrieve the result in a bounded time with bounded error. Nowadays, large-scale data analytical data processing workload are becoming more and more important in database area. Although parallel database technique and large scale parallel data processing model like MapReduce are becoming common, a single query processing several TB could still take hours to finish. In order to boost the performance of large scale data processing, BlinkDB, instead of returning exact result, choose to give approximate result with bounded error in bounded time. Moreover, in contract of most of its predecessor who has pre-assumption on data or aggregation type to provide approximate result, BlinkDB utilize the sampling technique to provide approximation yet has no pre-requirement on data. Moreover, since BlinkDB is built on top of Hive, so it is highly parallel and can suit for multiple backend computing engine like Spark and Hadoop. Based on its experiment, BlinkDB can return the results of within in 2 seconds for an origin data plan need for scanning for 17 TB data with error 2% - 10%.

Strengths:
1. This paper introduces BlinkDB, an approximate DB which can retrieve result in bounded time with bounded error. The experimental results show impressive performance gain.
2. BlinkDB is built on top of Hive, which makes it highly parallel and compatible for a set of backend computing engine. This also make it very easy to deploy for company already have Hive as their data warehousing technique.
3. One contribution of BlinkDB is that it innovative utilize the sampling technique to provide the approximate result, which have no assumption on existing data. This make BlinkDB more general for a variety of tasks.
Weakness:
1. BlinkDB currently don’t support for complex join to meet the memory constraint. It also doesn’t support for complex nested query.
2. This paper conduct a nice and impressive result on BlinkDB with a non-approximate DB. It would be more interesting if this paper can give more experiments comparing BlinkDB and all existing approximate DB implementations.




Review 28

This paper presented BlinkDB, an approximate query engine for ad-hoc queries on large data analyzing system. The main assumption of BlinkDB is that a typical industrial work load of aggregation queries is predictable as the query column sets are stable. The case study on Conviva and Facebook supported this claim. Based on this assumption, BlinkDB applies a multi-dimensional sampling strategy to build and maintain a variety of column-based samples. The prior research pointed out that when values of a particular range are rare, a uniform sample may not contain any members of the range at all. Uniform sampling could lead to very inaccurate result in such scenarios. Instead of uniform sampling, BlinkDB uses stratified sampling, which ensures rare values are sufficiently represented. To utilize stratified samples, BlinkDB implements a run-time dynamic sample selection strategy. The BlinkDB runtime first generate a set of different query plan based on different samples for the query. Then it execute the query on subsamples of those samples and generate statistics about the query’s selectivity. Lastly It constructs the Error Latency Profile for the query and evaluate query plan on ELP. The strategy selects an appropriate sized smaple to meet a query’s accuracy or response time constraint.

The main contribution of BlinkDB is to provide a framework that let users to describe query accuracy or response time requirement and applies sample-based method to meet such constraint. It extends the standard SQL to add syntax that support approximate query. It also implements a run-time sample selection strategy that dynamically pick a sample that can satisfy time or error constraint. It solves the optimization problem of deciding appropriate sample size with consideration on storage constraint.

Some weakness of this paper: It doesn’t support general joins or nested SQL. Though it claims that some can be flattened to simple queries, it could be a huge pain to do that in production. It introduces much more storage overhead as it stores multiple query sample sets.