Review for Paper: 22-BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

Review 1

This paper presents BlinkDB, a parallel sampling-based approximate query engine that provides support for ad-hoc queries with error and response time constraints. It is based on two basic ideas: multi-dimensional sampling, and run-time dynamic sample selection.

The paper starts by classifying workload assumptions, including predictable queries, predictable query predicates, predictable query column sets, and unpredictable queries. BlinkDB assumes predictable QCS.

The BlinkDB has two major components: (1) offline sampling module to maintain samples (2) a run-time sample selection module with Error-Latency Profile for queries and select samples to use.

It's worth mentioning selecting sample is beneficial since uniform sampling would fail in cases where some data are rare and might be un-sampled.

Experiment results show that BlinkDB can follow the given time/error constraints well, and query execution is magnitude faster than compared DBMS. It also has good convergence properties, and linear scaling response time with more cluster and data to work on.

The parts I like about this paper are: (1) it well defines the propose and background of the paper, with clear introduction to related works and quantitative analysis on their approaches to similar problems dealt in this paper (2) it has a good evaluation setup, covering latency, scaling, contraint, sampling method tests.

One thing I have doubt is, for evaluation on different sampling strategies, it's said unique QCS are tested, but how are these QCS selected is not mentioned. I also think too many math symbols are used, which seems inevitable but makes the paper hard to follow.


Review 2

Modern data analytics applications need real-time response rates for computing aggregates over a large number of records. This problem is important because it’s demanded by more and more cases including update ads in social networks, determine the subset of users experiencing poor performance based on their server provider and so on. However, previous solutions cannot fit today’s big data analytics workloads.

This paper proposes BlinkDB, which aims at achieving a better balance between efficiency and generality for analytics workloads. It allows users pose SQL aggregation queries over stored data with response time or error bound constraints, which balanced response time and the error rate. The sample creation module ensures that BlinkDB can answer queries about any subgroup regardless of its presentation in the underlying data by over-represent rare subgroups. The sample selection module runs query on multiple smaller sub-samples to quickly estimate query selectivity and choose the best sample to satisfy specified response time and error bounds.

Some of the strengths and contributions of this paper are:
1. BlinkDB use column-set based optimization framework to compute a set of stratified samples considering the frequency of rare subgroups, the column sets in the past queries and the storage overhead of each sample.
2. Error-latency profiles for each query at runtime select the most appropriate sample to meet the query’s response time or accuracy requirements.
3. BlinkDB can be integrated into existing parallel query processing framework like Hive with minimal changes.

Some of the drawbacks of this paper are:
1. Future query workload is assumed to be similar to historical queries in different ways. Therefore, unpredictable query models, which may be common, are hard to be sampled and processed.
2. BlinkDB does not support arbitrary joins and nested SQL queries.



Review 3

Modern data analytics applications increasingly demand near real-time response to computing aggregates over large volumes of data. Existing techniques either provide relatively poor performance for queries on rare tuples (OLA), or make strong assumptions about the predictability of workloads or substantially which limit the types of queries they can execute (sampling and sketch based solutions).

Therefore, this paper proposed BlinkDB, a distributed sampling-based approximate query processing system with a better balance between efficiency and generality for analytics workloads. BlinkDB is built on top of Apache Hive framework, and consists of two main modules: (i) Sample Creation and (ii) Sample Selection. The sample creation module creates stratified samples on the most frequently used QCSs to ensure efficient execution for queries on rare values. Based on a query’s error/response time constraints, the sample selection module dynamically picks a sample on which to run the query.

The contributions and advantages of the paper are as follows:
1. This paper uses a column-set based optimization framework to compute a set of stratied samples compared other approaches with only a single sample per table, which means the samples are more representative.
2. This paper considers optimization in: (i) the frequency of rare subgroups in the data, (ii) the column sets in the past queries, and (iii) the storage overhead of each sample, which significantly improves the query speed.
3. This framework creates error-latency profiles (ELPs) for each query at runtime to estimate its error or response time on each available sample. This heuristic is then used to select the most appropriate sample to meet the query’s response time or accuracy requirements.
4. This paper shows how to integrate the proposed approach into an existing parallel query processing framework (Hive) with minimal changes.
5. The proposed BlinkDB provides bounded error and latency for a wide range of real-world SQL queries, and it is robust to variations in the query workload.

The main disadvantage of the paper is that BlinkDB only focuses on aggregate queries. However, non-aggregate queries are also important to enterprises, because for example, we can do personal recommendations base on each users view history. But this model doesn’t support non-aggregate queries, which can be further explored. Besides, experience with queries in a large production big data cluster reveals that apriori samples are untenable. Accesses to input files are heavy tailed. And, queries often join multiple inputs or have complex filters. Consequently, systems like BlinkDB which store differently stratified samples of the inputs have small coverage and limited performance gains even when offered several times the size of the inputs to store samples.


Review 4

Problems & Motivations
Traditionally, modern data analytics applications run query with sequential scans over a large fraction of those data. However, with the rapid development nowadays, many and many data has been produced. This process becomes time-consuming. Yet, new applications demand real-time response rates. Existing approximate sampling algorithms either require strong assumption about the query workloads (Predictable Query?) or have bad performance on query corner data entries. Therefore, the authors propose BlinkDB.

Main Achievement:
The authors propose BlinkDB. According to the paper, BlinkDB requires a predictable QCSs workload and the authors argue actually it is moderate and fits the reality. In addition, it mentions another sampling algorithm called stratified sample algorithm. They argue that the uniform sample is not reasonable since the sample size is linear proportional to the table group data size. Furthermore, it provides many optimizations/improvements on stratified sample algorithms. For example, storing the sample according to the order of columns in \phi and zip the sample to reduce the storage overhead.

Drawbacks:
Not clear about what the graph tries to represent.



Review 5

In recent years, companies and other organizations have been accumulating greater and greater amounts of data, which combined with the increasing demands for near real-time response rates for certain applications, has made it harder for traditional database management methods to continue to handle while still maintaining acceptable performance. In response, approximation techniques such as sampling, sketches, and online aggregation have been proposed that allow users to trade off some accuracy for improved performance. These methods all have their drawbacks, though, such as limited adaptability to varying workloads. BlinkDB is a massively parallel approximate query processing system that aims to overcome many of these previous issues in order to achieve a good balance between raw performance and workload generality, while being able to make certain guarantees on the accuracy of the results presented (in the form of meaningful error bars).

The system is composed of two primary modules: Sample Creation, and Sample Selection. The former creates stratified samples on the data so that queries can be performed efficiently (as opposed to running on the full dataset), while still being able to answer queries that request rare values, which can require larger samples to produce the same level of confidence for estimates on that data. This often happens when queries request filtered or grouped subsets of the table, and uniform sampling methods where each group has the same number of rows, typically do not handle it as well. Stratified sampling ensures that these rare values are sufficiently represented. Multiple samples of combinations of different columns are stored in order to improve efficiency, and rows in each stratified sample are stored sequentially based on the order of columns, again, to improve efficiency for file access. The storage overhead incurred for doing so is only around < 10% for heavily used columns. Even so, the system balances the trade off between maintaining more samples for better coverage and the storage costs needed to maintain them. In any case, after samples are created, the sample selection module dynamically picks the sample on which to run the query, based on the one it predicts will be best able to satisfy the user requirements for error and/or time. One of its tasks is to pick an appropriately sized subsample based on time or error constraints, based on the error latency profile for queries that characterizes how quickly error/response time changes as the sample size increases. The goal is to find the smallest sample that meets these requirements, after which the query is run.

The main strength of this paper is that it presents a new, working database system capable of running SQL queries in an approximate fashion. As their results show, they were able to achieve a 10 to 200x speedup for queries with a 1% error bound at 95% confidence, compared to standard Hive on Spark (both with and without caching), due to having to read far fewer data points. As the authors note, in some cases, BlinkDB returned results in a few seconds while it took thousands of seconds for the other systems. This was true for both the TPC-H benchmark and Conviva, which is more representative of a real-world workload. Additionally, the authors’ decision to utilize stratified sampling resulted in lower error compared to uniform sampling.

Besides the obvious drawbacks of using sampling rather than deterministic methods that make Blink DB less appropriate for certain applications, there are some other areas of improvement for this approximate query engine. One is how feasible it is to support a wide array of arbitrary queries, especially when error latency profiles have to be built for each, which supposes a known or relatively predictable workload beforehand. Also, quite a bit of precomputation is done in constructing the samples, which raises the question of when it is necessary to recalibrate the sizes of each group, since the data distribution may change over time as new records are written or old ones updated/deleted.


Review 6

The main contribution of this paper is BlinkDB, which greatly reduces the response time of queries on large data by using statistical methods. This is important, as the paper mentions because, within a certain error bound, you can find results in a fraction of the time which are close enough to the actual value for practical use. This is accomplished by selecting a sample of the full dataset which represents different “query column sets” (QCSs) well enough to query the sample and yield a relatively accurate approximation. The contributions of this paper were computing stratified samples, generating error and response-time profiles for queries on each of the samples they created, and finally integrating the previous contributions into the Hive parallel query processing framework.

As we look into the background that motivated these contributions, we define 4 types of workloads on your large data in decreasing flexibility, as far as potential to optimize. The first model is predictable queries where the future queries are known in advanced, so you can explicitly optimize for these queries (this cannot adapt to a different set of queries, however). The next model is predictive query predicates where the values in WHERE, GROUP BY, and HAVING clauses do not change over time. The third model is predictable QCSs, where the groups of columns may appear at a predictable percentage of the time, but the values in them may change over time. And the final model is completely unpredictable queries, which must be optimized at the level of a single query. This article focuses on the third model as we can use this assumption to pre-compute samples of our large dataset.

Next we see how BlinkDB support queries using AVG, SUM, QUANTILE, and COUNT, and a time or error constraint. With these parameters, we can create samples small enough to fit in memory unlike the large datasets they are coming from. The article indicates that when given a time and error bound we must find the optimal sample size that is small enough to not take too long but large enough to ensure the desired confidence level. The article mentions stratifying samples, which involves over-representing rare groups in the data (so that they are not excluded from the sample). We can optimize storage of families of stratified samples by maintaining one sample for the whole family. We see that this optimization makes for a very small storage footprint relative to the original dataset.

To implement these contributions into Hive, the authors added a shim layer to the HiveQL parser and extended it to generate samples as well. They also implemented an uncertainty propagation module that would return the confidence intervals and error bounds to the user, along with the result. I liked how this paper presented its ideas in a logical flow. The paper did get a bit dense towards the middle when formulating the problem, but it was overall thorough.



Review 7

BlinkDB

This paper proposes BlinkDB which is a distributed sampling based approximate query processing system that achieves better balance between efficiency and generality for analytics workloads. One of the idea behind BlinkDB is adaptivity. It is an adaptive optimization framework. Another idea is to use a dynamic sample selection strategy which selects sample based on query's accuracy and response time requirement.

The reason for the generation of BlinkDB is the requirement of near real-time response rate in the applications with the advent of big-data. And this demand is applicable in multiple scenarios: advertising events updating on social networks and customer feedback analysis system. Those higher demands in processing queries bring the birth of approximation techniques. As the paper stated, however, the previous approximation techniques are not good fit for today's requirement. They either made strong assumptions about the query workload or fewer assumptions which induced highly variable performance.

The BlinkDB proposed in this paper consists two module. The first one is sample creation and the second one is sample selection. The first one is to create stratified samples on the most frequently used query column sets to ensure efficient execution for queries on rare value. The second one, based on a query's error time constraints, picks a sample on which to run the query. For the overall system, the BlinkDB is modified on Apache Hive framework by adding two major components which is aforementioned in this paragraph. the second component uses error latency profile to decide the sample to run the query which best satisfies the users' constraints After the run-time sampling selection, it creates a range of uniform and stratified samples across dimensions by relying on distributed reservoir sampling or binomial sampling techniques. Beyond that, it also augments the query parser, optimizer and aggregation operators to bound the error and execution time.

The main contribution of this paper lies to the following: first, it uses column-set based optimization framework to computer stratified samples and consider the frequency of rare subgroups in the data and the column sets in the past queries and the storage overhead toward of each sample. The second, it brings error-latency profiles for each query at runtime. This procedure is to estimate the error and response time on each samples. This is also used in selecting the most appropriate sample to meet the needs of response time and accuracy demand. Third, this paper also gives an example on integrating algorithms into a parallel query processing framework with few changes. And unlike previous work, BlinkDB is robust in various workload and provides bounded error and low latency.

One of the thing that this paper may improve is, as the paper goes, it only allows users to pose aggregation queries over stored data. However, these queries are relatively easy ones and the BlinkedDB can handle it very well. This paper, however, does not have experiment regarding non-aggregation queries which will truly evaluate if the system is fast. But overall, this is an excellent paper.



Review 8

This paper proposed BlinkDB, a parallel approximate query engine for huge amount of data. When a very large database is used, people sometime want the query to be answered very quickly, so this paper provide a method to use accuracy to trade query time. Unlike the previous approximate query engine assume the complete workload of database, BlinkDB proposed the method based on query column set. The basic idea is the percentage of a particular column that is used to group or filter will stay stable, but the value that is used in grouping is unpredictable.

BlinkDB mainly implements two modules: an offline sampling module and a run-time sample selection module. The offline sample module works to create sample and the key idea is to be stratified. The idea is that for each column the minority group should also be well represented in the sample. So if those are queried, the BlinkDB is able to give fairly good response.
THe sample selection module chooses sample in run time to respond to the query. When a query is given, it also comes with an error limit or time limit. The runtime module decide on what sample the query should run with the limitation. Then, the query is processed on the sample and an error estimation it returned.

The main contribution of the paper is proposed the BlinkDB, an approximate query engine that not assume future workload. Secondly, the innovation of BlinkBD is that it allow user to specify an error or time bound. The BlinkDB formulate a way to better characterize the feature of queries, figure out what is stable. The experiment is ran on real world OLAP data, and the BlinkDB performed well.

I think the limitation of BlinkDB falls on its assumption. It assume that on each column the grouping or filtering is kind of stable. However, the database is self is developing, the needs of the user is also constantly changing. Say when a new index is added, or the emphasis of query changed, the BlinkDB will fail to perform well because of lack of data.




Review 9

The paper presents a system for approximate query processing. The general goal is to make queries a) correct within some bound and b) as fast as possible. This goal is brought on by many new applications, which require quickly calculated aggregations. The authors lay out the problem as finding a sample such that a query can run as quickly as possible while still falling within some user-defined accuracy bound.

The workload is modeled as predictable query column sets - the idea here is that future queries will have the same percentage of group bys and filters on sets of columns as previous queries. This is used to build samples, as opposed to other techniques on opposite ends of the scale, such as assuming that all queries will be exactly known from past queries or that nothing can be known from past queries. This allows the system to support ad-hoc queries.

Stratified samples are used to ensure that samples used in the evaluation of aggregation operators contain values from underrepresented groups. Finally, an error latency profile is used to find the samples that are appropriate for a given query. The system is integrated into HIVE. The primary contribution is the new ability for the user to easily trade latency for accuracy in their queries.

I found many pieces of this paper to be very well done and written in a way such that they were easy to understand. One thing that I really appreciated was that the authors gave a significant number of examples throughout. The first comes in the introduction: an application requiring fast response times for a huge aggregation query. These types of examples continue throughout the paper, and show the real-life need for the system. Additionally, I noted that the paper has links to all figures. This may simply be because the paper is more recent than others, but it really helped as figures are often located on different pages than their descriptions.

Despite the attempts of the authors to make them more clear, I still found many of the equations to be difficult to follow. I did appreciate the attempt to describe notation ahead of time in Table 1. However, there is a significant amount of notation described, and it was a pain to need to reference this table each time something was used again many pages later. I believe that there was an attempt to make these clear (many equations have text explanations) which was appreciated.



Review 10

This paper introduces BlinkDB, a large scale parallel, approximate query engine build on top of Hive query engine and Hadoop distributed file system. The goal of BlinkDB is to allow users to trade off query accuracy for response time, therefore be able to run interactive SQL queries on a very large amount of data and get an approximate result within seconds.

There are two key observations in this paper. The first one is that a given workload has only a small set of QCSs (query column set). This means QCSs are predictable and generalize well to future workloads. By using this workload model, BlinkDB also provides high query flexibility. The second observation is that traditional uniform sampling is unable to draw enough samples from rare groups and leads to large errors or even missing groups.

Based on these observations, the design of BlinkDB consists of two main modules: sample creation module and sample selection module. To create a sample for a single query, BlinkDB uses stratified sampling technique. Every group in a certain QCS will have the same number of samples. This prevents any group from being under-represented. For rare groups, all rows will be sampled and a query will be answered accurately, while for common groups enough samples are taken to give a well-approximated result. To select a set of QCSs on which stratified samples should be taken, BlinkDB solves an integer linear program optimization problem in order to satisfy a given storage constraint and maximize the coverage of these QCSs on a given workload.

At runtime, BlinkDB tries to find the best-stratified samples (or simply use uniform samples) to use for a given query. BlinkDB does this by creating an error latency profile for all stratified samples. For each sample, it first runs the query on different subsets of the sample and estimate its error or latency, and then project the results onto larger samples. Based on the profile generated, BlinkDB selects the one satisfies error or latency requirements.

One shortcoming I found is that since workload may not be stationary, BlinkDB needs to periodically create new samples. This process is time-consuming and the performance during this process is unclear. It’s also not clear how to determine when new samples are needed and how to set the parameter for controlling the percentage of samples that can be changed at any single time.


Review 11

In the paper "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data", Sameer Agarwal, Barzan Mozafari, and Co. develop BlinkDB - a parallel, approximate query engine for running interactive SQL queries on large volumes of data. In the modern era, the amount of data that is being generated has been increasing exponentially. This information spans fields such as health care, finance, and social media. Storing all of this data is one thing, but being able to query and work with it to generate meaningful insights on a timely basis is an issue that is an area of active interest. One proposed solution to this challenge is through the use of approximate databases, where results are generated on the fly by sampling data, rather than iterating sequentially through data. By trading off an acceptable amount of accuracy, tremendous amounts of performance can be achieved. This has broad applications in areas such as exploratory data analysis and information visualization, where researchers are more interested in the overall trends rather than extremely accurate results. Speaking to this, BlinkDB is able to process over 17TB of data in less than 2 seconds with an error of 2%-10%! Thus, it is clear that this is a problem that is interesting and important.

Agarwal alludes to previous works in the field of approximate query engines. With any approximate query engine, there is a trade-off between flexibility and efficiency. Unpredictable queries fall within the domain of On-Line Aggregation (OLA). Predictable queries fall within the domain of traditional databases that know their workload ahead of time. Materialized views sit near traditional databases; they operate under the assumption that they are able to predict query predicates. BlinkDB sits near OLA - rather than making no assumptions on the underlying data, it assumes it can make accurate predictions about query column sets (QCS). Thus, the approaches are similar, but the techniques employed are different - namely, the sampling method. The paper is divided into the following sections:
1) System Overview: BlinkDB extends the Apache Hive framework and applies two major components: an offline sampling module that maintains samples over time and a run-time selection module that creates error latency profiles for queries. To decide these samples, the the QCS that appear in queries are used and different sampling techniques are employed. Currently, there is support for many aggregate queries and GROUP BY, but there is limited support for joins.
2) Sample Creation: BlinkDB uses stratified samples in order to account for the cases when uniform sampling fails. In particular, uniform sampling may perform well on queries that do not filter or group data, but does poorly if the opposite is true. Thus, we need stratified sampling to sufficiently represent rare subgroups. Taking into account the sparsity of data, workload, and storage costs, this turns into an optimization problem that increases exponentially with the number of columns. Thus, we only need to consider column sets that appeared in past queries or eliminate column sets that are too large.
3) BlinkDB Runtime: Depending on the set of columns, the selectivity of selection predicates, and data placement/distribution a stratified or uniform sampling is selected. Furthermore, the size of the sample depends on time and accuracy constraints, computational complexity, physical distribution of data in a cluster, and available cluster resources at runtime. Afterwards, a latency profile is constructed for all queries with response time constraints and an attempt is made to eliminate bias on non-uniform samples.
4) Evaluation: Uses workloads from Conviva Inc. and TPC-H benchmark. As expected, we have much faster performance than its non-sampling counterpart.

Even though BlinkDB solves a problem that many researchers working with OLAP data are facing, it still has many drawbacks. One drawback is the lack of exploration on other sampling techniques that could be used to operate BlinkDB. Stratified sampling is used to obtain a fair and representative sample of the groups, but what if you do not want to include outliers? Another drawback is the inclusion of a lot of technical details, specifically statistics. I understand that it is necessary to touch upon these topics, but I feel that it confused me even further than was intended. Lastly, the final drawback that stood out to me was in the "related work" section. When talking about different existing approximate query engines, the authors did not place them on the flexibility vs efficiency scale. Thus, it was hard to understand what setting and use cases these engines were used in relation to BlinkDB.


Review 12

This paper describes BlinkDB, a DBMS that allows the user to run much faster queries at the cost of some accuracy. Queries on very large tables can take inordinate amounts of time due to loading each part of the table into memory separately. If the query is performed on a small sample of the table, then the entire subset can fit in memory, while the result may still be very close to the true result. Thus, BlinkDB has two jobs; creating useful samples of tables, and picking the right sample for each query that comes in.

Samples should be created based on some measure of prediction for incoming queries. Queries can be predicted on one of several levels, with each subsequent level being more broad, but offering less predictive power:
1. The queries themselves are known
2. The predicates of the queries are known
3. The columns of the queries are known
4. The queries are entirely unpredictable
BlinkDB operates under the 3rd assumption.

When creating samples, uniform samples often aren’t good enough. For each group of columns, the larger groups get more representation in the sample, which makes the larger groups have more accurate results than smaller groups, which isn’t ideal. To correct this, BlinkDB uses stratified samples, which put a cap on the number of rows in each group, making group sizes more even. As such, each sample is built on a certain subset of columns. For any table, BlinkDB selects a set of samples that has maximal column coverage while fitting in a predetermined space.

Whenever a query is processed, BlinkDB must select an appropriate sample. The sample must be built on a superset of columns included in the query. There may be multiple appropriate samples, so for each one, a small subsample is selected, and the query is run on the subsample. Timing and error profiles can be determined from the subsample, so among all samples, BlinkDB picks the one with timing and error profile matching any timing and error bounds that may be in place, and then runs the query on the full sample, reporting this as the final result.

BlinkDB is able to find a way to make large speed gains at the cost of some accuracy in aggregate operations, where a loss of accuracy may be acceptable. Since BlinkDB knows the sampling rate, it can even estimate the degree of accuracy lost. Stratified sampling is extremely important to the paper, and it is covered in exacting detail, but some of the computations on how sample selection is optimized can be hard to follow. The experimental results are effective at showing how smaller or larger samples trade off between estimation accuracy and query runtime, as well as demonstrating how stratified samples tend to have less estimation error than uniform samples, showing their usefulness.



Review 13

This paper presents BlinkDB, which is an approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB uses two key ideas: (1) an adaptive
optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects
an appropriately sized sample based on a query’s accuracy or response time requirements. The paper first presents categories of workload taxonomy: Predictable Queries, Predictable Query Predicates, Predictable QCSs, and Unpredictable Queries. BlinkDB uses the model of predictable QCSs. As the paper shows, this model provides enough information to enable
efficient pre-computation of samples, and it leads to samples that generalize well to future workloads in our experiments. To test the validity of the predictable QCS model, the paper analyzes a trace of 18, 096 queries from 30 days of queries from Conviva and a trace of 69, 438 queries constituting a random fraction of 7 days’ workload from Facebook to determine the frequency of QCSs. The experiment shows that QCSs are relatively stable over time, which suggests that the past history is a good predictor for the future workload. BlinkDB has a similar architecture with Apache Hive but with two more components: an offline sampling module that creates and maintains samples over time, and a run-time sample selection module that creates an ELP for queries. BlinkDB also has a sample creation process to accurately and quickly answer queries. A standard approach is uniform sampling: sampling n
rows from T with equal probability, but this approach is imperfect. To improve, another approach is proved to be better: since error decreases at a decreasing rate as sample size increases, the best choice simply assigns equal sample size to each groups, and the assignment of sample sizes is deterministic. There are a few problems with BlinkDB: sparsity of the data, conditions where samples are not beneficial to actual queries, and too many stratified samples causing large storage space. Finally the paper explains how the implementation is done and how the experiments are run to evaluate BlinkDB. With the experiment results, the paper concludes that BlinkDB is effective at handling a variety of queries with diverse error and time constraints.

The advantage of this paper is that it covers both advantages and potential issues with BlinkDB. With potential problems, the paper also explains approaches to solve the issues in details.

The disadvantage of this paper is not enough background before diving into technical details. Since BlinkDB is and extension of Apache Hive, maybe a high-level overview of Hive will be helpful with understanding of this paper.


Review 14

This paper proposed BlinkDB, which is a massively parallel, approximate query engine for running queries on large scale of data. By "approximate", it means a tradeoff between error bar and running time, using partial data to approximate the precise query answer.

So it is important for this kind of query engine o decide what types of samples to create to perform a partial query on. BlinkDB makes the assumption that the frequency of columns used in grouping and filtering remains static over time, while the exact value being queried on are unpredictable. They introduced a term QCS or query column set to describe the set of columns used in a grouping or filtering query. By making the assumption, BinkDB implemented two main components: an offline sampling module and a run-time selection module. BlinkDB treats the creation of samples as an optimization problem for each QCS. The sampling algorithm defines a “sparsity” function that accounts for the rarity of items in the data.

The strong part of BlinkDB to me is that the key observation of tradeoff and approximate really helps in reality. It is actually desirable in reality that we can get an answer in a very short time, although it is a little different from the precise answer. Such error is acceptable in production.

The paper is well written, and charts and figures are used to illustrate many aspects. There is no obvious problem in my point of view.


Review 15

In this paper, the authors introduced a novel approximate DBMS called BlinkDB which supports queries with bounded errors and response times on very large data. BlinkDB allows users to trade off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. In this paper, they discussed a new approximation technique for large data. The design of approximate DBMS is still a significant problem, with the rapid growth of the amount of data, it is very expensive to go through every entry and respond to the query, one clever idea is to use a subset of sampled data to do an approximation, though sacrifice some accuracy, this method yields extreme speed-up for OLAP workloads. As they said in their paper, BlinkDB can answer queries on up to 17TBs of data in less than 2 seconds with an error of 2-10%. In some scenario, it is definitely worthwhile to sacrifice some accuracy but achieving more than 200x speed up, which is very impressive! Next, I will summarize the crux of the BlinkDB with my own understanding.

There are some existing methods for approximation in DBMS like sampling, sketches and online aggregation (OLA), however, none of the previous solutions is a good fit for today’s big data analytics workloads. OLA provides relatively poor performance for queries on rare tuples while sampling and sketches make strong assumptions about the predictability of workloads or substantially limit the types of queries they can execute. BlinkDB is come to rescue, it is a distributed sampling based approximate query processing system that strives to achieve a better balance between efficiency and generality for analytics workloads. BlinkDB consists of two main modules: Sample Creation and Sample Selection. The sample creation module creates stratified samples on the most frequently used QCSs to ensure efficient execution for queries on rare values. These samples are designed to efficiently answer queries with the same QCSs as past queries and to provide good coverage for future queries over similar QCS.

I think BlinkDB has many advantages when compared to other methods. First of all, they adopted an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from the original data over time. This method makes less assumption compared to traditional sampling methods and obtains more valuable information by solving the optimization problem. Second, they utilized a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy or response time requirements. Next, BlinkDB supports more general queries as it makes no assumptions about the attribute values in the WHERE, GROUP BY, and HAVING clause, or the distribution of the values used by aggregation functions. From an engineering perspective, BlinkDB provide a flexible query interface for end users, this flexibility enables end users to configure their system according to their current workloads and change dynamically, it is a really cool idea and definitely useful in industry.

Generally, it’s a nice paper describing a novel approximation method in handling large data. The downsides of this paper are minor. Although they made less assumption about the complete knowledge of the tuples, as they said in their paper, if the distribution of QCSs is stable over time, their approach create samples are neither over- nor under- specialized for the workload, however, if the distribution of QCSs are variant, I think this prediction may not be very well. Similarly, due to this reason, the BlinkDB is not so good for the database where the data changes very frequently.



Review 16

This paper introduces BlinkDB, which is a parallel database that introduces a new way to do approximate queries that solves problems that previous approximate query engines had. In particular, previous approximate query engines either made overly powerful assumptions about workloads or made no assumptions at all, which resulted in overly inaccurate results or results with overly poor performance. BlinkDB solves this with a 2-module approach. The first module is a sample-builder that uses stratified samples that over-represent (comparatively) smaller groups. This module works offline and works on query column sets that have been shown (by testing on Facebook & Conviva Inc workloads) to not change too much over time. By stratifying samples like this, more generality is provided.
The second module is an online sample selection module that computes error & latency measurements for sub-samples using heuristics in order to select which sample works best given the *user-specified* time & accuracy requirements. It does this by creating separate “profiles” for error and latency and then correcting bias by weighting subgroups depending on the effective sample weight.
The primary contribution of the paper is a system that seems to bridge the gap between overly general query engines and overly assumptive query engines. Allowing user-specified constraints is a major advantage as it means that workload variance is less of an issue while still providing substantial performance benefits over more “general” query engines. With respect to the paper itself, I think the introduction and motivation was extremely well-written—the problem was very clear within the first few minutes I spent reading it, and I felt like I had a very clear high-level picture of BlinkDB’s solution after finishing the first section.
One weakness I think the paper has is that it makes overly conclusive statements about the stableness of QCSs. Real-life workloads from Facebook and Conviva are analyzed and it was determined that QCSs were stable in those workloads, which I think definitely means that QCSs *can* be stable, but I don’t think that this necessarily means that we can assume this is true for all or even most workloads—I think this is a topic that might need its own research paper to prove. Also, another minor weakness is that any system that allows for more user-specification (time & accuracy requirements) is inherently more complicated to use than one that does not—this is by nature and many times is not an issue, but is still a trade-off.


This paper introduces BlinkDB, which is a parallel database that introduces a new way to do approximate queries that solves problems that previous approximate query engines had. In particular, previous approximate query engines either made overly powerful assumptions about workloads or made no assumptions at all, which resulted in overly inaccurate results or results with overly poor performance. BlinkDB solves this with a 2-module approach. The first module is a sample-builder that uses stratified samples that over-represent (comparatively) smaller groups. This module works offline and works on query column sets that have been shown (by testing on Facebook & Conviva Inc workloads) to not change too much over time. By stratifying samples like this, more generality is provided.
The second module is an online sample selection module that computes error & latency measurements for sub-samples using heuristics in order to select which sample works best given the *user-specified* time & accuracy requirements. It does this by creating separate “profiles” for error and latency and then correcting bias by weighting subgroups depending on the effective sample weight.
The primary contribution of the paper is a system that seems to bridge the gap between overly general query engines and overly assumptive query engines. Allowing user-specified constraints is a major advantage as it means that workload variance is less of an issue while still providing substantial performance benefits over more “general” query engines. With respect to the paper itself, I think the introduction and motivation was extremely well-written—the problem was very clear within the first few minutes I spent reading it, and I felt like I had a very clear high-level picture of BlinkDB’s solution after finishing the first section.
One weakness I think the paper has is that it makes overly conclusive statements about the stableness of QCSs. Real-life workloads from Facebook and Conviva are analyzed and it was determined that QCSs were stable in those workloads, which I think definitely means that QCSs *can* be stable, but I don’t think that this necessarily means that we can assume this is true for all or even most workloads—I think this is a topic that might need its own research paper to prove. Also, another minor weakness is that any system that allows for more user-specification (time & accuracy requirements) is inherently more complicated to use than one that does not—this is by nature and many times is not an issue, but is still a trade-off.


This paper introduces BlinkDB, which is a parallel database that introduces a new way to do approximate queries that solves problems that previous approximate query engines had. In particular, previous approximate query engines either made overly powerful assumptions about workloads or made no assumptions at all, which resulted in overly inaccurate results or results with overly poor performance. BlinkDB solves this with a 2-module approach. The first module is a sample-builder that uses stratified samples that over-represent (comparatively) smaller groups. This module works offline and works on query column sets that have been shown (by testing on Facebook & Conviva Inc workloads) to not change too much over time. By stratifying samples like this, more generality is provided.
The second module is an online sample selection module that computes error & latency measurements for sub-samples using heuristics in order to select which sample works best given the *user-specified* time & accuracy requirements. It does this by creating separate “profiles” for error and latency and then correcting bias by weighting subgroups depending on the effective sample weight.
The primary contribution of the paper is a system that seems to bridge the gap between overly general query engines and overly assumptive query engines. Allowing user-specified constraints is a major advantage as it means that workload variance is less of an issue while still providing substantial performance benefits over more “general” query engines. With respect to the paper itself, I think the introduction and motivation was extremely well-written—the problem was very clear within the first few minutes I spent reading it, and I felt like I had a very clear high-level picture of BlinkDB’s solution after finishing the first section.
One weakness I think the paper has is that it makes overly conclusive statements about the stableness of QCSs. Real-life workloads from Facebook and Conviva are analyzed and it was determined that QCSs were stable in those workloads, which I think definitely means that QCSs *can* be stable, but I don’t think that this necessarily means that we can assume this is true for all or even most workloads—I think this is a topic that might need its own research paper to prove. Also, another minor weakness is that any system that allows for more user-specification (time & accuracy requirements) is inherently more complicated to use than one that does not—this is by nature and many times is not an issue, but is still a trade-off.