Review for Paper: 28-A Comparison of Approaches to Large-Scale Data Analysis

Review 1

Recent academic papers have claimed that MapReduce is an unprecedentedly fast system for performing simple queries on large, distributed data sets. But distributed databases were a pre-existing technology that can perform similar work to MapReduce, while also providing the features of a relational database. For example, distributed RDBMSs such as DBMS-X and Vertica offer indexing, automatic data parsing at load time, and enforcement of schemas. So the question emerges, what are the trade-offs in practice between MapReduce and conventional distributed DBMSs?

Andrew Pavlo et al. compare the performance of Hadoop, the open source MapReduce implementation, with two commercial parallel DBMSs, DBMS-X and Vertica. They find that the DBMSs run queries faster, while MapReduce is easier to set up and tune for performance. MapReduce imports data faster in some cases, because it simply loads files from disk, without parsing the data to fit a schema. Yet even on simple queries like the “original MapReduce task,” grep-ing a large distributed file for a target string, MapReduce runs queries much slower than a distributed DBMS. This is because MapReduce must spend considerable time loading data into memory or shipping to various map nodes before it begins processing the query, while the other DBMSs load the data at startup and can begin processing more quickly.

One important insight from this paper is that MapReduce jobs have large start-up costs, as data must be shipped to the nodes that perform the work, written to disk, and read from disk again, at each stage of a computation. In addition, this performance cost may increase as nodes are added to the worker set. This is because as more nodes are added to, for example, the reduce computation, each map worker must produce more partitions, which will be read my more recipients, leading to likely delays.

Similar to the Spark paper that criticized MapReduce as a too-limited framework for many workloads, this paper suggests that traditional distributed DBMSs are preferable to MapReduce in some use cases, without addressing in depth how MapReduce could be altered to address the problems raised. For example, the authors note that MapReduce startup time could likely be reduced with small changes to the framework. If such changes were made, the paper's main performance results would likely tell a story more favorable to MapReduce.


Review 2

This paper focuses on a comparison of approaches to Large-scale data analysis featuring MapReduce and Parallel DBMS. The main motivation of this paper was to revisit the major design constructs based on distributed design systems since many attempts have been made in the particular field and a comparison was necessary to set them apart for the general audience.

The two approaches talked about include MapReduce which uses a distributed file system, using functions like map, split, copy and reduce and a MapReduce Scheduler handling the tasks efficiently. The Parallel DBMS on the other hand makes use of standard relational tables with data being partitioned over cluster nodes. It makes use of SQL for querying. MapReduce is flexible in terms of schema support whereas parallel DBMS require a relational schema. MR is good for single applications while Parallel DB is good for multiple applications. MR requires low level programming model but Parallel DB uses SQL. MR doesn’t provide any indexing support, on the other hand, Parallel DB uses Hash/B-tree indexes. MR has better fault tolerance. MR is better for on-demand computations while Parallel DB is better for repeated computation.

The paper is successful in providing an in-depth comparison between the two major large-scale data analysis approaches. It provides side-by-side evaluation of both on various factors and provides a through performance evaluation of both with relevant graphs.

The paper doesn’t provide insight about the complexity part of the two approaches. The installation part though is highlighted but is not expanded upon. A market survey of the same would have helped build the case a little stronger.



Review 3

Recently the trade press has been filled with news of the revolution of “cluster computing”. This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of low-end servers instead of deploying a smaller set of high-end servers. With this rise of interest in clusters has come a proliferation of tools for programming them. One of the earliest and best known such tools in MapReduce(MR). MapReduce is attractive because it provides a simple model through which users can express relatively sophisticated distributed programs, leading to significant interest in the educational community. For example, IBM and Google have announced plans to make a 1000 processor MapReduce cluster available to teach students distributed programming.Given this interest in MapReduce, it is natural to ask “Why not use a parallel DBMS instead?” Parallel database systems (which all share a common architectural design) have been commercially available for nearly two decades, and there are now about a dozen in the marketplace

The purpose of this paper is to consider these choices, and the trade-offs that they entail. It begins with a brief review of the two alternative classes of systems, followed by a discussion of the architectural trade-offs. Then, it presents the benchmark consisting of a variety of tasks, one taken from the MR paper, and the rest a collection of more demanding tasks. In addition, it presents the results of running the benchmark on a 100-node cluster to execute each task. It tested the publicly available open-source version of MapReduce, Hadoop, against two parallel SQL DBMSs. It also presents results on the time each system took to load the test data and report informally on the procedures needed to set up and tune the software for each task.

Instead of a paper, I would take it as a article by a blogger. The paper spends large amount of time discussing the difference between MapReduce and DBMS, which I don't think they are comparable, because the first could be applied to more general setting and support customization, while the later is already tuned for typical queries and provides simplicity for those type of work.



Review 4

This paper provides a comparison between two different approaches to large-scale data analysis, and the two different approaches are MapReduce paradigm and parallel SQL DMBS paradigm. Both systems run on a “shared nothing” collection of computers, and achieve parallelism by dividing any data set to be utilized into partitions. MapReduce consists only two functions Map and Reduce that are written by a user to process key/value pairs, while parallel DMBS store data on multiple machines and enable parallel executions. The results show some trade-off between two approaches. In short, parallel DBMS provides better performance and MapReduce provides better easiness and loading/tuning data time. This paper first gives an overview about two approaches, as well as architectural trade-offs. Then it provides a performance evaluation on a benchmark for both systems. Finally, it presents some discussion about the test results.

There is no general problem here since it is not a new design but a comparison between two different existing approaches to large-scale data analysis. The need for comparison comes from the fact that new paradigms like MapReduce dramatically surface to the market bringing new innovations. However, we have existing parallel SQL DBMS which can also handle these problems. Thus this paper provides a more detailed analysis on both approaches and tries to find some trade-offs between them.

The major contribution of the paper is that it provides a detailed comparison between MapReduce and Parallel DBMS, including examples and benchmarks. It also brings up some interesting discussions about the trade-offs between two approaches. Here we will summarize the key trade-off below:

1. Both parallel DBMSs have significant performance advantage over Hadoop MR
2. Parallel DBMSs provide the same response time with fewer processors that use less energy
3. Hadoop has more easiness to set up and use
4. Parallel DBMSs are difficult to configure and require repeated assistance
5. Hadoop does better in extensibility
6. Hadoop does a superior job of minimizing the work that is lost when a hardware failure occurs

One interesting observation: this paper is in general good to compare both approaches, and it provides detailed experiments and analysis. However, when comparing both approaches, this paper takes some assumptions that might not be true, such as the systems would have the same relative performance on 1000 nodes. Also this paper does not use Google version of MR.


Review 5

This paper was trying to address the problem of how traditional parallel databases differ from the relatively new MapReduce paradigm. This problem is important as with parallel DBMS around for more than 20 years, MapReduce as a relatively new computing model has still gained considerable interest currently. Therefore it is important to understand the advantages and disadvantages of both systems, so that future developers would be able to pick the most appropriate system, and/or improve the current design accordingly.

The main approach proposed by the paper is to compare the design and performance of parallel DBMS and MapReduce implementation. Specifically, the paper first gave a high-level work-flow introduction for the design of both systems, focusing on how they address the large-scale data analysis. It also discussed major components in both systems’ architecture comparing and contrasting the parallel DBMS and MapReduce side by side. The paper also designed a handful of benchmarks targeting various data load and execution workload to compare the performance of both parallel DBMSs and MapReduce design from multiple perspectives.


The strengths of this paper include:
- The paper constructed its analysis scenarios in a pretty well-rounded manner and it used an easy-to-understand flow. The paper first presented a clear description of the similarities and differences of how MapRedue and parallel DBMSs handle large-scale data processing from a conceptual level. It then discussed the architectural differences and lastly presented benchmarks covering different large-scale data execution characteristics.
- The benchmarks the paper chose also showed its careful design. It captured the differences starting from the system setup/deployment, and then moved onto data loading and different types of query executions.
The main technical contribution of the paper is that it ran a relatively full-scope comparison of parallel DBMSs and MapReduce implementation. It also presented a detailed analysis of how each system performed, with discussion of what a user should consider when deciding which system to adopt, as well as what could be done to improve each system.

The main drawbacks of this paper is that:
- The paper seems to be more biased to the parallel DBMS side. Most of its analysis, especially the benchmarks, are devoted to show how parallel DBMS outperform MapReduce, both from the design and the actual performance result perspective. However as mentioned in the beginning of the paper, MapReduce has gained a significant amount of interest, so there must have been some advantage or potential power of MapReduce that can outperform parallel DBMS significantly to draw people’s attention into researching MapReduce. It might be more helpful to also spend some analysis on why MapReduce has been able to successfully gaining people’s research interest so far, even if the paper’s result seem to indicate that MapReduce is mostly at disadvantage compared to parallel DBMS.
- As mentioned briefly in the paper, more than 20 years of fine tuning has been devoted to parallel DBMS performances. As the benchmarks are run with commercially available parallel DBMSs, the paper compared the largely fine-tuned parallel DBMS performances with the MapReduce implementation provided by the paper. It might also be more fair to discuss how much effort the paper put into implementing relatively efficient Map and Reduce functions.


Review 6

This paper address two different models of computation for large scale data analysis - MapReduce, and parallel DBMSs using SQL. This is a raging debate currently, similar to the great debate of the 70's - is a schema model better, or is the structure-less user-creates-the-data-access-algorithm MapReduce programs?

A large part of this debate comes down to schema vs no schema. MapReduce programmers must write the program so that it can extract and convert the data, and store it in memory for computation. While this may seem for flexible, it also involves quite a bit more work from the programmer. It is also worth considering the difficulty this presents if more than one person is working on this code. While SQL statements are easy to understand - SELECT,FROM,WHERE - MapReduce code requires that the developers essentially agree on a "schema" for the data, even though it is not programmatically defined or enforced.

The way in which MapReduce does distributed computation can also lead to serious bottlenecks. MapReduce uses a "pull" function to get data into the reduce function, which can lead to heavy disk accesses and seek times. Parallel databases generally try to use a "push" function, so that each node can minimize disk seeks. Parallel DBMS can also take advantage of indexing, while MapReduce uses brute-force scans for every computation.

In the realm of fault tolerance, the MapReduce system is more robust. Since it breaks every computation into a set of map and reduce functions, it can simply restart any jobs that fail - the data is already replicated in HDFS/GFS, so the system doesn't have to worry about that. A parallel DBMS, where one node is waiting on output from another node, will have more difficulty dealing with faulty hardware.

The problem really seems to be the legacy architecture in most commercial DBMS, not the relational model. Vertica maintains the relational model, while using a column store, compression, and cluster computing - and this allow it to perform dramatically better than Hadoop/MapReduce. As we've seen from many of the papers that we've read, it is time for a full rewrite of databases systems to adapt them to modern hardware. However, we should understand that their are serious benefits to the relational model, and not quickly abandon it to use the newest, "coolest" software. This paper presents a very well thought out response to the MapReduce fad that's been sweeping the industry.


This paper address two different models of computation for large scale data analysis - MapReduce, and parallel DBMSs using SQL. This is a raging debate currently, similar to the great debate of the 70's - is a schema model better, or is the structure-less user-creates-the-data-access-algorithm MapReduce programs?

A large part of this debate comes down to schema vs no schema. MapReduce programmers must write the program so that it can extract and convert the data, and store it in memory for computation. While this may seem for flexible, it also involves quite a bit more work from the programmer. It is also worth considering the difficulty this presents if more than one person is working on this code. While SQL statements are easy to understand - SELECT,FROM,WHERE - MapReduce code requires that the developers essentially agree on a "schema" for the data, even though it is not programmatically defined or enforced.

The way in which MapReduce does distributed computation can also lead to serious bottlenecks. MapReduce uses a "pull" function to get data into the reduce function, which can lead to heavy disk accesses and seek times. Parallel databases generally try to use a "push" function, so that each node can minimize disk seeks. Parallel DBMS can also take advantage of indexing, while MapReduce uses brute-force scans for every computation.

In the realm of fault tolerance, the MapReduce system is more robust. Since it breaks every computation into a set of map and reduce functions, it can simply restart any jobs that fail - the data is already replicated in HDFS/GFS, so the system doesn't have to worry about that. A parallel DBMS, where one node is waiting on output from another node, will have more difficulty dealing with faulty hardware.

The problem really seems to be the legacy architecture in most commercial DBMS, not the relational model. Vertica maintains the relational model, while using a column store, compression, and cluster computing - and this allow it to perform dramatically better than Hadoop/MapReduce. As we've seen from many of the papers that we've read, it is time for a full rewrite of databases systems to adapt them to modern hardware. However, we should understand that their are serious benefits to the relational model, and not quickly abandon it to use the newest, "coolest" software. This paper presents a very well thought out response to the MapReduce fad that's been sweeping the industry.


Review 7

Problem and Solution:
The paper is proposed because the MapReduce model for large-scale data analysis is quite popular currently though the parallel database management system has been existed for a long time. It wants to compare the performance of MapReduce and parallel DBMSs and evaluate the development complexity to find the trade-offs of those different systems. The reason to do so is to give the database community an objective analysis and some education. The same part of MapReduce and parallel DBMS is that they are both implemented for large-scale data processing, which means they can handle large amounts of data and loads. And the differences are a lot. One is that MapReduce does not need data transformation or loading, indexing, as well as the schema of a database. However, parallel DBMS needs all of them since they are necessary to a database. From this aspect, it is much easier to write MapReduce programs. When solving complex problems, developers tend to write multiple MapReduce programs to solve it, while joins and nested queries are used in DBMS. And parallel DBMS can implement user defined function to do some specific works like MapReduce, but it is complex and hard to use. One shortcoming of MapReduce is that its intermediate results are stored in disk for following use. It is possible to decrease the performance if there are multiple rounds of such disk I/O. Also, it does not has transactions, as well as copies of data for disaster recovery.

Contributions:
The main contribution of this paper is to provide a detailed comparison of MapReduce and parallel DBMS to find out the trade-offs, so that the designers can think about them objectively, instead of overwhelmed by the popularity of MapReduce and lose their hearts and minds. It is a good practice to look back to learn the experience and gain lessons.

Weakness:
The paper is clear and good. My only concern is that the performance depends greatly on the way to process the queries. Since MapReduce and parallel DBMS use different languages, it is possible to exist the bias in implementation as well as in the experiment result.


Review 8

This paper compares the performance and implementation benefits of MapReduce framework and a parallel database system. MapReduce paradigm has become a hot topic in large-scale data processing, with great efforts put into the system, but its advantage over a more traditional parallel database system is unclear. If anything, parallel database system can provide the same level of functionality as MapReduce, and actually showed better performance results in their analysis.

In Parallel DBMS, most tables are partitioned over the nodes in the cluster, and the system uses an optimizer to breakdown a SQL command into a query plan that divides its execution among multiple nodes. Unlike MapReduce, the programmer does not need to know the underlying storage details of the system, but the data have to fit in the relational paradigm of rows and columns. MapReduce is better when the data processing requires minimum sharing, but when there is sharing involved, parallel DBMS does a much better job managing the data.

When comparing performance, we see that parallel DBMS performs significantly better than MapReduce system under most cases. The biggest detriment to MapReduce in its execution time is its start-up cost which can be quite significant when the data is small. While MapReduce performed poorly compared to a parallel DBMS, it was significantly easier to implement and setup compared to other systems. This paper does a good job analyzing the two approaches and assessing the benefits of each. Given that many of the performance advantages of parallel DBMS came from the various optimization techniques that were developed for the model over the years, it seemed like the downside of MapReduce is simply the lack of maturity of the technique, not necessary some inherent downside of the method.



Review 9

This paper conducts several experiments on Hadoop, X-DBMS and Vertica and analyzes the performance between MapReduce and parallel DBMS in large-scale data analysis. The performance is tested for the workloads such as the original MR tasks and data analytical tasks. The results of those experiments are interesting:

Based on the experiments, the performance of Hadoop is far less good than the parallel RDBMS. In data analytical task, such as select, aggregate,and join especially, it takes Hadoop much more time to finish the tasks. The underlying reasons might be: 1.Hadoop lacks data compression which makes it utilizes disk bandwidth inefficiently. 2.As nodes scales out, there will be much more communications between master and workers. 3.For some operations like join, Hadoop does not have efficient algorithms to do the task as operators in RDBMS.

Hadoop is easier to configure while it takes longer for the parallel RDBMS. Since programmers are familiar with object-oriented programming language, it would be much easier for them to make a Hadoop task work. As for parallel DBMS, there exist more knobs to turn, with more complexity and flexibility.

This paper gives an insight in the performance of MapReduce, and conducts many exciting experiments to compare the performance between different data analysis approaches: MapReduce and parallel DBMS. However there might be still some points uncertain:
1.MapReduce is a newly developed approach. It cannot be expected to have a relative mature toolbox for many different task, like join in RDBMS. It is possible that more useful and efficient function models can be added to Hadoop.
2.It is possible that, in the experiment, the code for MapReduce is not fully optimized for the task and the parameters like # output files can be tuned for a better performance. Though the results might not be that accurate for this reason, yet this may not influence performance compared to the other approaches.
3.MapReduce has a good nature in fault tolerance. This is stated in the paper but I think it would be much convincing to conduct large-scale tasks under some random failure among nodes (if feasible).


Review 10

Problem/Summary:

MapReduce is very popular, but much of its functionality is similar to that of parallel databases- input data is filtered (selected), collected by key (joined), and reduced (aggregated). This paper discusses some of the differences between MapReduce and parallel databases, tests the performance of both, and discusses how the differences contribute to the performance differences.

MapReduce stores its input as text files without schema on a file system. This means that it must load the data when the job starts. Additionally, the MapReduce job controller doesn’t know what is in the input files, and cannot make scheduling decisions based on this information. In contrast, parallel databases store data in tables, before any queries are run. They also create indexes and partition the data based on attributes. This means that when queries are run, the query planner can use its knowledge of partitions to construct an efficient query plan. This optimization is also possible since SQL is a set-at-a-time language, while MapReduces uses record-at-a-time manipulation which doesn’t allow much automatic optimization. For these reasons, parallel databases performed much better than MapReduce on most of the tests run by the authors of this paper.

The authors also considered some non-performance differences between the two systems. Parallel databases require more effort to set up a job, since data must first be loaded, and schemas have to be defined. In MapReduce, data loading is fast and no schemas are required. MapReduce also provides more fault tolerance than parallel databases, since MapReduce jobs only need to restart a Map or Reduce task when a machine fails, while a parallel database must restart the entire query. Parallel database technology is more mature, which means that many useful operators have been defined and implemented that save development time.

Strengths:

This paper presents many insightful differences between parallel databases and MapReduce, and suggests how both technologies can learn from each other to further develop.

Weaknesses:

The UDF test was convoluted with the hack-ish implementation they had to go with to get all three databases to perform the same functionality.



Review 11

The paper addresses questions that many of readers who are familiar with databases might already have in their minds: How existing parallel databases and MapReduce programming model line up? What are the pros and the cons of each approach? The paper attempts to answer these questions comparing the performance of Hadoop and row and column store parallel databases in a number of scenarios.

The experiments are conducted using a cluster of 100 nodes. The paper argues that its experiment setup is sufficient as it is not likely that there are many Hadoop MapReduce users that require 1000 nodes. The paper reasons that many of parallel databases handle petabytes of data with less than 100 nodes. I think this argument may or may not be true. It is true that there are not many companies that handle such large data that require thousands of nodes currently. However, we must not forget the fact that the amount of data generated from the Internet is constantly growing. While it does not hurt the current validity of the argument in the paper, it may be different in the future.

Various experiments performed in the paper demonstrate that Hadoop performs much faster than parallel databases in loading data as it does not perform extra data processing such as index creation, but parallel databases outperforms Hadoop by a significant margin in processing various queries with its numerous built-in optimizations such as reducing data transfer whenever possible, data compression and so on. The authors also comment that it is much easier to configure and run Hadoop than running parallel databases. The paper does a good job in comparing MapReduce and parallel databases from different perspectives, clearly demonstrating advantages and disadvantages of each data processing model.

The paper basically suggests that MapReduce is good for ad-hoc parallel jobs, where a user does not need to define schemas for its data and there is no sharing with other programs. Whenever the performance matters and the sharing of data between programs is required, parallel databases become a better choice compared to MapReduce. From the experimental results of the paper, I think this argument is fairly reasonable. I am just curious to see whether it is possible to combine two data processing models taking advantages of each while reducing their disadvantages.

To sum it up, the paper demonstrates advantages and disadvantages of MapReduce and parallel databases from an extensive set of experiments. Readers should be able to learn from the paper that there is no single best approach even if most metrics in the experiment demonstrate that parallel databases are better. There are many other aspects to consider as the paper mentions from users’ point of view, things like ease of configuration and other tools available for the system.



Review 12

This paper compares different aspects of parallel and MapReduce databases and evaluates them on a variety of different parameters: performance, fault tolerance, ease of use, start up time, etc. The paper concludes that parallel databases outperform MapReduce by a factor of 2 or 3. Parallel databases also are easy to use since a user only has to specify the results they want in the SQL language and the database takes care of it automatically. Furthermore, parallel databases contain many properties to facilitate easy access or sharing of data such as fixed schemas and built in indexing. MapReduce, on the other hand, requires a programmer to have an understanding of the underlying data in order to create the Map and Reduce functions. However, MapReduce was found to be much easier to set up. One reason for this was due to the flexibility in that MapReduce does not need the data to be in a fixed format. Another advantage of MapReduce is that its fault tolerance mechanism results in significantly less work lost due to hardware failure since it simply re-executes work that was lost due to failure. However, the paper notes that this tolerance comes at the cost of storage space.

I like the fact that the paper does not simply compare performance between the two architectures, but also looks at other details such as ease of use that may not usually be considered. However, it seems a little unfair to compare an opensource implementation of MapReduce to commercial implementations of parallel databases. The paper states that it doesn't think commercial implements, such as the one Google is using, will significantly change the results, but to me, this assumption doesn't validate the performance results. I would have also liked a section in the conclusion summarizing the advantages and disadvantages of each architecture. This paper was written with the intention of comparing the two and it would have been useful to have all the points in the paper summarized at the end in a clear, well-defined manner.


Review 13

This paper focuses on the comparison between MapReduce (Hadoop) and parallel DBMSs (DBMS-X and Vertica) to see how they perform differently on large-scale data analysis. These two approaches have some common elements, but there are also some differences; for example, they both distribute data into several partitions but DBMSs require the data to have more well-defined schema.

MapReduce consists of two user-defined functions, "map" function that generates intermediate key/value pairs and "reduce" function that merge values with the same key. The input data and the intermediate data are distributed over several splits (M map tasks and R reduce tasks), so the map workers and the reduce workers are able to work in parallel.

In parallel DBMSs, the query optimizor transforms the queries into query plans that the executions are distributed to multiple nodes. This paper states that DBMSs are much faster and more simple than MapReduce, but it needs longer time to tune and load the data. This is because that parallel DBMSs can reorganize the input data at load time. It can thus perform some optimization on the data in order to speed up the operations in the following tasks.

Based on the experiments performed in this paper, both of the two parallel DBMSs outperform MapReduce. The major reason of the performance differences between MapReduce and parallel DBMSs is that MapReduce needs to scan the entire input data.

Contribution:
1. This paper gives a concise introduction of MapReduce and parallel DBMSs, and shows the common parts as well as the differences between them.
2. They perform five different tasks over 100 nodes to show that parallel DBMSs outperform MapReduce.

Drawback:
1. The experiments in this paper does not give the memory usage of each task. It would be better if we can see if there is a trade-off between memory and run time since parallel DBMSs can use redundancy and materialized views to speed up its query processing.


Review 14

Purpose of the Paper

This paper was written to compare the complexity and performance between MapReduce and parallel SQL databases. The paper concludes that load and tune execution of parallel DBMSs took longer than that of MapReduce systems, but, overall, the performance of parallel DBMS was better. The paper suggests that future systems can take from the advantageous characteristics of both kinds of architectures.


Details of the Comparison of MapReduce and Parallel SQL

Both Parallel DBMS and MR provide high-level programming environments and parallelize readily. SQL DBMSs require that data conform to a well-defined schema, while MR permits data to be in any arbitrary format. Although MR users can structure the data in any way they choose, they have to put in the extra work of writing a custom parser. Users also have to manually write support for complex data structures such as compound keys. Different users of MR have to also agree on a schema and set integrity constraints. SQL, in contrast, separates schema from application and automatically defines the schema and complex data types to save users from extra work of agreeing on integrity constraints. There are also differences in how each system provides indexing and compressing optimizations, programming models, data distribution, and query execution strategies. In MR, the input data is stored in a collection of partitions in a distributed file system deployed on each node in the cluster. The Map function reads a set of records from the input file, filters the data, outputs to a set of intermediate records, and splits function partitions into disjoint buckets by applying a Reduce function to the key of each output record. For parallel SQL, tables are partitioned over the nodes in a cluster, systems use an optimizer that translate SQL commands into a query plan, and execution is divided among multiple nodes. MR is suited for environments with a small number of programmers, but not as fitting for larger-sized projects. MapReduce does not provide built-in indexes. Programmers must implement any indexes that they desire to speed up access to the data inside of their application. The specifications of implemented indexes must be communicated between programmers. It is thus more desirable to store indexes in a standard format in system catalogs, as SQL does. MR is reminiscent of CODASYL programming because users have to write algorithms in a low-level language for the performance of record-level manipulation. Parallel SQL DBMS has a parallel query optimizer to balances computational workloads and minimize the data transmitted over the node cluster network. MR, on the other hand, distributes work manually, and filtration is done after the statistics for all documents are computed and shipped from reduce workers. SQL applies filters on each node first, and then groups the documents from the same sites together, saving I/O. The systems differ in execution in that MR tries to read the input files from the same map node simultaneously, inducing large numbers of disk seeks and slowing the disk transfer rate. Parallel SQL DBMS does not materialize the split files, and pushes instead of pulling the transfer data. MR is easier to set up and is better at failure, but incurs a performance penalty on task sets that need to manually split the value by the delimiter character into an array of strings.


Strengths of the Paper

I enjoyed the paper because it provides a comprehensive set of tests to compare MR and parallel SQl, unlike the MapReduce paper that just ran the grep test and did not consider how MR performs when there is a need for manual splits of the value by the delimiter character.


Limitations of the Paper

I would’ve liked to see the paper clarify on how it knew they wrote a good parser for the experiments for MapReduce. I would’ve also liked to see more explanation on how they know Google’s version of MapReduce, which they admit to perform better than Hadoop’s MapReduce, would still underperform in comparison to parallel SQL.



Review 15

This paper is titled "A comparison of approaches to large-scale data analysis" which is a very appropriate one line summary of what this paper accomplishes. This paper makes the point that the basic paradigm of MapReduce has existed for a long time and that parallel SQL databases implement a similar model. This paper looks at the differences between these paradigms and compares them in terms of performance and complexity. This paper takes a set of benchmark tasks and runs them on 100 node clusters. The five benchmark tasks consisted of the Grep task, data loading, selection, aggregation, and the join task. The authors discuss the outcomes of these tests and conclude that parallel systems showed significant performance advantage over Hadoop. Additionally, that Hadoop provides a system that is easier to set up and more flexible in terms of fault tolerance.

The strength of this paper comes from its recognition of the similarity of parallel databases and MapReduce paradigms and the ability to recognize previously used datasets and compare these two types of systems for similar tasks. They also make a good point of tackling common objections to their course of action. On page 2 they state that even though they are using 100 nodes they should obtain results representative of real world data and be able to estimate the type of results they would obtain using 1000 nodes, even though they state that it "is not at all clear how many MR users really need 1000 nodes".

This paper was published in 2009 which means that it came out five years after the first MapReduce paper. This concerns me because in the justification of their arguments they state that they are using the original MapReduce task. A lot can change over five years after a point in time where you release a popular idea to the public. They do examine a few other tasks but I am not convinced that parallel databases are better in terms of performance than Hadoop for all somewhat real world tasks.



Review 16

Part 1: Overview

This paper presents a survey of popular large scale data analysis methods. Database system comes into the industry when we treat data as defined schemas. MapReduce appears as Google needs some new programming model for complex problems in large scale of data. MapReduce borrows many crucial ideas from parallel database systems, including partitioning data sets and hashing data to redistribute records with identical key values to the same node for subsequent processing. However, parallel databases perform poorly in fault tolerance as queries would fail if some nodes fail. Parallel databases also suffer from scalability. It is hard for parallel databases to be elastic.

There is a trend from using Hadoop to Hive, which is going to be close to parallel databases. Parallel databases are moving towards MapReduce as the elasticity could be improved. Streaming database systems is becoming a hotspot as standard repeating queries are becoming important in practice. To support large amounts of data in the distributed environment, several elements should be included. Schema support, indexing, programming model, data distribution as well as execution strategies should all be taken into consideration.

Part 2: Contributions

Experiments are done to show that pDBMSs outperform Hadoop in selection and join tasks. Grep task and analysis task are both done and we can see the comparison between the popular large scale data processing mechanisms.

Databases nowadays all take advantage of B-tree indices to speed up the execution of selection problems. Aggressive compression techniques with ability to operate directly on compressed data is also shared by large scale databases. In addition, both MapReduce and parallel databases utilizes stochastic parallel algorithms for querying large amounts of relational data. This paper summarizes the similarities.

Part 3: Drawbacks
As compared in the paper, the platform is 100 node hadoop cluster and Java 6. Shared nothing DBMS is also used. Well 0.19.0 Hadoop is kind of out of date, things are changing fast in this area.



Review 17

===overview===
In this paper, a read-optimized relational DBMS is presented, called C-Store. It contrasts sharply with most current systems, which are write-optimized. There are a lot of other differences, and most important ones include:
1. store data by column rather than by row
2. careful coding and packing of objects into storage while processing queries.
3. same column can be stored in multiple collections of column-oriented projects
4. wide use of different data structures to improve availability, such as bitmap indexes to complement B-tree structures
5. unique combination of different techniques

The design focuses on reducing disk access for reads while still support general updates. So the paper also presents the data model implemented by C-Store. They also explain the two major parts of the design: RS and WS, which stand for Read-optimized store and writable store respectively. They are connected by high performance tuple mover. The new storing and partition scheme makes it different to do query optimization and handle locality issue, and paper also address those problems.

In summary, C-store provides a whole new view of how to optimize database for read queries. At the end, the paper also shows that on TPC-H style queries, C-store outperform alternate systems. But the author also mentioned that the potential overhead of WS might be large. Even though the complete form of C-store was not finished, a new perspective is inspiring.

===strength===
The paper provides some strong arguments of where column partitioning might be better than traditional relational DBMS.

===weakness===
However, not too many experiments are provided to show the potential. As said in the paper, the project is not completed yet. Maybe we should wait for a more complete project and more experiment results.


Review 18

This paper makes a compare between Map-Reduce (Hadoop) and parallel DBMS (Vertica and X). The authors claim that this comparison makes sense because both of them can do many similar things and have similar data size of PB.
This paper stats with introducing MapReduce and Parallel DBMSs, then made discussion on and comparison between these two on some important aspects:
1) Schema Support is necessary from the author point of view because it is needed eventually. If there is no schema in data file for MapReduce job, then programmer still needs to write some code to shape the data set.
2) Indexing is well supported in DBMS as a way to accelerate computation, but was lacked in MR.
3) Author claim that DBMS has better programming model because it is declarative.
4) Data distribution and Execution Strategy is better on DBMS and Flexibility is equal, according to author.

After analysis, this paper presented an actual performance comparison using benchmarks. They used Hadoop as representative of MapReduce and Vertica and DBMS-X as representative of Parallel DBMSs.
The Result shows that MapReduce has very short data loading time and relatively longer code execution time.

Contribution:
This paper presented an interesting comparison between two very different systems. This makes people think about what exactly we are doing when we are writing MapReduce jobs or SQL queries.

Weakness:
I don’t think this is a fair comparison. The tasks that used are all common operations in DBMS such as Selection, Aggregation and Join. After years of optimization, of cause DBMS is faster. And MapReduce program used in this experiment might not be optimized. This paper is quite biased toward DBMS in my opinion.



Review 19

This paper aims to compare the performance claims made by Map-reduce in comparison to established parallel database systems. The authors have conducted a thorough analysis explaining each of the experiments in a detailed manner.

These are few of the observations made by the authors:
1. MapReduce is built keeping in mind scalability upto 1000 nodes. However, the authors believe that that amount of hardware is just wasteful and the parallel database systems optimized for performance can do much better without that hardware.
2. In case, there is no sharing involved and the processing is a one-time process, Map reduce can be a lot faster and advantageous. However, if there is sharing involved, there might be too much overhead trying to understand the schema of the data in comparison to a parallel DBMS where you can query for the schema as much as you can query for the data and there is no such overhead.
3. MapReduce is a plain brute force method of implementation because it has so many servers that it can use, but there is no way in which you can tune the performance of the server in accordance with the data that is being processed by it. Also, in a parallel
4. MapReduce definitely has a mature failure model in comparison to the typical parallel database systems and can handle and recover from a failure lot more gracefully.
5. It is definitely easier to begin writing a MR program in a developer’s view, however, it would be very difficult to maintain or reuse the code for two different deployments.

The experiments, as I mentioned earlier were very clear and detailed and the authors have done a tremendous job of detailing the same, not omitting any details therefore giving the reader a chance to be able to understand the complete theory behind the experiment. It is impossible not to see how the parallel DBMSs fare far better than Hadoop except for the UDF aggregation task. However, as the authors themselves have mentioned, since they have used Hadoop which is not the purest form of MapReduce, the results are not necessarily exact representations of the paradigm. The authors claim that this system has failed to learn any of the lessons that the DBMS community has learnt over the past years. But I honestly think that it is a matter of implementation and accordingly, you would choose either MapReduce or parallel DBMSs.

Overall, the authors have definitely pointed out specific improvements that MapReduce could incorporate in order to be as efficient and competent as the well-established parallel database systems. Whether, that would get rid of the simplicity of implementing it is something to be seen.



Review 20

A Comparison of Approaches to Large-Scale Data Analysis

In this paper, the author introduced the discussion about the two approaches in large-scale data analysis. To begin with, the author proposed the idea that the current enthusiasm towards Map-Reduce paradigm is a major step backwards. Because the author believes that the basic control flow has existed in the parallel SQl database for over 20 years. And the author provides analysis in both theoretical reasons and real world test results to show that the performance of large scale parallel database outperforms Map-Reduce dramatically.

In detail, this paper listed the difference in many design considerations systematically:
1. Schema Support: Parallel database needs to conform data to relational paradigm, but map-reduce is schema-free. MR need for a custom parser in order to derive the appropriate semantics for their input records. when no sharing is anticipated, the MR paradigm is quite flexible
2. Indexing: Parallel DBMS uses hash or Btree indexing reduces the scope of the search dramatically. Most database systems also support multiple indexes per table. While MR don’t support any indexes.
3. Programming Model: Parallel DBMS use a high level relation model, user can “State what you want”. But MR is forcing user to go back to the codasyl time, user needs to write algorithms in a low-level language in order to perform record-level manipulation.
4. Data distribution: Parallel DBMS send computation to data, however Map-Reduce framework needs to pass intermediate data between the two stage of computation.
5. Execution Strategy: parallel dbms uses push mechanism to transfer data (no materialization of the split files), yet MR would pull mechanism to draw in input files - induces large disk seeks.
6.Flexibility: Parallel dbms has less flexibility, yet more robust and less complexity in queries.
7.Fault tolerance: In parallel DBMS.

To sum up, based on the above naturally carried advantages, the parallel DBMS represented by Vertica outperforms hadoop in various benchmarks, including selection, aggregation, join and even the MR grep benchmark.

However, as is admitted by the author, the major drawback of traditional parallel dbms are that, it operates on well defined schemas and only support limited input data types. However, in today’s new web environment, MR is more convenient in processing ‘schema later’ or ‘no schema’ data. Also, in terms of the large size of online data set size, choosing one hundred node cluster is never a brilliant idea. Maybe because of the author are mostly from the traditional dbms vendors, they would always regard DBMS system to be used for sale. But the truth is that the Map-Reduce clusters used by google is consists of way more than 100 nodes, and they are not for sale. The last but not least weakness of this paper is the performance test criteria: since most map reduce job will only need one shot of running to create the results, we don’t need to consider the case of running the same job for many times on the same set of data. And more importantly, if the MR job is only running for one-shot, the loading overhead should be added to each execution of the test benchmark---which will show that traditional dbms is slower.



Review 21

In this paper, it talks about two large-scale data analysis paradigm, one is MapReduce another is parallel database. The paper describes and compare the performance and architecture for both of these two and define a benchmark for evaluation these two paradigm. The result in the paper show some trade-offs and shows the advantage and disadvantage for MR and parallel database.

MapReduce is well suited for development environments with a small number of programmers and a limited application domain. The MR doesn’t require schema and no logical data independent, also MR model do not provide built-in indexes. It may need programmer to write multiple MapReduce jobs to perform a complex task and do the optimization. But the MR doesn’t provide transaction.
Parallel database have schema and data should fit in the schema when loaded, also the parallel database have built-in indexes. The parallel database has less sophisticated failure model than MR.
Then the paper present their benchmark for experience on MR and two parallel database DBMS-X which is parallel shared nothing row-store from a major vendor and Vertica which is a parallel shared nothing column-oriented database.

In conclusion, MR is much easier to set up and the load times are faster. But the query times are a lot slower than parallel database. Then MapReduce is designed for one-off processing tasks where fast load times are important and no repeated access. And parallel database is for data that maybe re-analysis again and again.

Strength:
The paper analysis the two different model for processing large scale data analyzing and do a comparison of these two. It concludes the advantage and disadvantage of each paradigm and show in which situation the one is better than another.

Weakness:
When comparing parallel database and MR, the paper should mention some potential improvement for these two. Like, it possible to have a better optimizer for MR to do the indexing and store some metadata. Also it is possible to have a soft-tuning configuration for parallel database. Also, I think heavy overhead of start-up for MR tasks should be very bad in some situation(i.e high frequency stock trading).


In this paper, it talks about two large-scale data analysis paradigm, one is MapReduce another is parallel database. The paper describes and compare the performance and architecture for both of these two and define a benchmark for evaluation these two paradigm. The result in the paper show some trade-offs and shows the advantage and disadvantage for MR and parallel database.

MapReduce is well suited for development environments with a small number of programmers and a limited application domain. The MR doesn’t require schema and no logical data independent, also MR model do not provide built-in indexes. It may need programmer to write multiple MapReduce jobs to perform a complex task and do the optimization. But the MR doesn’t provide transaction.
Parallel database have schema and data should fit in the schema when loaded, also the parallel database have built-in indexes. The parallel database has less sophisticated failure model than MR.
Then the paper present their benchmark for experience on MR and two parallel database DBMS-X which is parallel shared nothing row-store from a major vendor and Vertica which is a parallel shared nothing column-oriented database.

In conclusion, MR is much easier to set up and the load times are faster. But the query times are a lot slower than parallel database. Then MapReduce is designed for one-off processing tasks where fast load times are important and no repeated access. And parallel database is for data that maybe re-analysis again and again.

Strength:
The paper analysis the two different model for processing large scale data analyzing and do a comparison of these two. It concludes the advantage and disadvantage of each paradigm and show in which situation the one is better than another.

Weakness:
When comparing parallel database and MR, the paper should mention some potential improvement for these two. Like, it possible to have a better optimizer for MR to do the indexing and store some metadata. Also it is possible to have a soft-tuning configuration for parallel database. Also, I think heavy overhead of start-up for MR tasks should be very bad in some situation(i.e high frequency stock trading).


In this paper, it talks about two large-scale data analysis paradigm, one is MapReduce another is parallel database. The paper describes and compare the performance and architecture for both of these two and define a benchmark for evaluation these two paradigm. The result in the paper show some trade-offs and shows the advantage and disadvantage for MR and parallel database.

MapReduce is well suited for development environments with a small number of programmers and a limited application domain. The MR doesn’t require schema and no logical data independent, also MR model do not provide built-in indexes. It may need programmer to write multiple MapReduce jobs to perform a complex task and do the optimization. But the MR doesn’t provide transaction.
Parallel database have schema and data should fit in the schema when loaded, also the parallel database have built-in indexes. The parallel database has less sophisticated failure model than MR.
Then the paper present their benchmark for experience on MR and two parallel database DBMS-X which is parallel shared nothing row-store from a major vendor and Vertica which is a parallel shared nothing column-oriented database.

In conclusion, MR is much easier to set up and the load times are faster. But the query times are a lot slower than parallel database. Then MapReduce is designed for one-off processing tasks where fast load times are important and no repeated access. And parallel database is for data that maybe re-analysis again and again.

Strength:
The paper analysis the two different model for processing large scale data analyzing and do a comparison of these two. It concludes the advantage and disadvantage of each paradigm and show in which situation the one is better than another.

Weakness:
When comparing parallel database and MR, the paper should mention some potential improvement for these two. Like, it possible to have a better optimizer for MR to do the indexing and store some metadata. Also it is possible to have a soft-tuning configuration for parallel database. Also, I think heavy overhead of start-up for MR tasks should be very bad in some situation(i.e high frequency stock trading).


In this paper, it talks about two large-scale data analysis paradigm, one is MapReduce another is parallel database. The paper describes and compare the performance and architecture for both of these two and define a benchmark for evaluation these two paradigm. The result in the paper show some trade-offs and shows the advantage and disadvantage for MR and parallel database.

MapReduce is well suited for development environments with a small number of programmers and a limited application domain. The MR doesn’t require schema and no logical data independent, also MR model do not provide built-in indexes. It may need programmer to write multiple MapReduce jobs to perform a complex task and do the optimization. But the MR doesn’t provide transaction.
Parallel database have schema and data should fit in the schema when loaded, also the parallel database have built-in indexes. The parallel database has less sophisticated failure model than MR.
Then the paper present their benchmark for experience on MR and two parallel database DBMS-X which is parallel shared nothing row-store from a major vendor and Vertica which is a parallel shared nothing column-oriented database.

In conclusion, MR is much easier to set up and the load times are faster. But the query times are a lot slower than parallel database. Then MapReduce is designed for one-off processing tasks where fast load times are important and no repeated access. And parallel database is for data that maybe re-analysis again and again.

Strength:
The paper analysis the two different model for processing large scale data analyzing and do a comparison of these two. It concludes the advantage and disadvantage of each paradigm and show in which situation the one is better than another.

Weakness:
When comparing parallel database and MR, the paper should mention some potential improvement for these two. Like, it possible to have a better optimizer for MR to do the indexing and store some metadata. Also it is possible to have a soft-tuning configuration for parallel database. Also, I think heavy overhead of start-up for MR tasks should be very bad in some situation(i.e high frequency stock trading).


Review 22

This paper presents a comparison of MapReduce to more "traditional," relational database systems. The authors discuss high-level architectural similarities and differences between the two approaches to large-scale data analysis. They deploy Hadoop, a parallel DBMS from a relational DBMS vendor, and Vertica to a 100-node cluster. The experiments they run show that except for the initial loading of the data, the DBMS systems are able to run multiple times faster than Hadoop MapReduce.

The main contribution of this paper is that it compares MapReduce with parallel relational database systems for data analytics. It notes the strengths of each approach and conducts a performance comparison.

This paper conducted a fairly comprehensive set of experiments on a large dataset. It also gave a thorough discussion of the advantages of each approach. However, this paper demonstrated several weaknesses:

1. It conflated the use of MapReduce with GFS/HDFS. MapReduce can use a variety of data sources e.g. BigTable/HBase which store sparse tables.

2. The experiments ran on dedicated hardware. MapReduce is designed to run on shared clusters where stragglers are common and failures must be dealt with regularly. This brings with it overhead that clearly biases the results towards the DBMS systems as they improve performance by sacrificing reliability (not caching data to disk, dealing with full system failures).

3. Some of the arguments presented are subjective or only apply to specific environments. For example, the authors mentioned that schemas are important for development. However, MapReduce was developed for an environment where all developers share a single repository and data is structured (eventually as protocol buffers) in a way that allows developers to use the same code to read in and read out the data, effectively providing a data schema.


Review 23

This paper compares MapReduce to parallel databases and explains the pros and the cons of each for different types of workloads. When MapReduce was first introduced, it was seen as a completely new programming model. However, parallel databases have been around for much longer and we want to know which would work better in what circumstances.

First of all, there are some differences between MapReduce and parallel databases that we should address. Parallel databases have been known to require the schema of the database to be relational while MapReduce does not enforce a schema. Parallel databases still have indexes built from hash tables or binary trees, but MapReduce does not have any built in indexing. MapReduce does not have a smart way to distribute data as the programmer needs to do this manually, but parallel databases use the data distribution to their advantage. MapReduce has a more sophisticated failure model that can recover from node crashes while parallel databases usually just restart tasks when there is a failure.

With these differences, we will now take a look at certain workloads and compare the runtimes. We are going to look at the runtimes of Hadoop, Vertica, and DBMS-X. With the traditional grep problem, loading the data with Hadoop runs significantly faster than both loading with Vertica and loading with DBMS-X. When actually executing the grep task, though, Vertica and DBMS-X beat Hadoop in terms of runtime. With all of the other tasks, we found that Hadoop still had the poorest runtime.

Overall, I think that this paper does a decent job at comparing MapReduce with parallel databases. However, there are some concerns that I have with the approach:

1. Hadoop implements the bare minimum version of MapReduce, so there isn’t much optimization, but Vertica has a lot of optimization with how to parallelize the jobs. Would a better implementation of MapReduce make it faster than parallel databases?
2. This sample only goes up to 100 nodes and we know that MapReduce works a lot better on larger clusters. How does MapReduce scale in comparison to parallel databases up to thousands of modes?


Review 24

This paper gives a comparison of the merits of MapReduce versus parallel database management systems. Many proponents of MapReduce argue that it offers high performance with an interface that is easy for programmers who don't have experience writing code for distributed systems. Proponents of parallel DBMSs argue that most of the applications in which MapReduce supposedly excels can be handled with SQL and that parallel DBMS offer higher overall performance.

The authors compare many aspects of the two systems. One of the arguments for parallel DBMSs is that declarative languages such as SQL provide a reusable high level syntax that can be adapted to many disparate situations. The API for MapReduce on the other hand provides only a few functions, chiefly Map() and Reduce(). This leads to users having to specify how data should be structured, how indexes should be created and maintained, etc. This makes code difficult to share between applications that use MapReduce, since the interface for each application may be slightly different. SQL, on the other hand, provides a standard syntax so that code can be easily shared and understood by programmers other than the original author. The authors make an excellent point in comparing arguments for MapReduce to the arguments given for Codasyl in the 70s. Namely, MapReduce provides more flexibility while parallel DBMSs offer uniformity. In comparisons of performance, the authors found that Parallel DBMSs such as Vertica and DBMS-X outperformed MapReduce by up to an order of magnitude, depending on the applications.

My chief complaint with this paper is that they seem to be slightly biased in favor of parallel DBMS. While the many reasons they listed supporting parallel DBMS over MapReduce were valid, and while they did offer some evidence in favor of MapReduce, I felt as if they did not emphasize the importance of application and environment specificity quite enough. Many of their arguments in favor of MapReduce were nearly afterthoughts, tacked on to the end of a paragraph, while the rest of the section delved more deeply into aspects in which Paralledl DBMSs excelled.


Review 25

This paper compares various aspects between MapReduce paradigm and parallel SQL database management systems. Since MapReduce is more and more popular, there is a question natural to ask the reason why not use parallel DBMS. Therefore, this paper researches on the difference between two paradigms in many aspects, including schemas, indexing, programming model, execution strategy, flexibility, and fault tolerance. In addition, the paper also conducts some experiments for performance comparison and discuss about the reasons of the difference.

First, the paper describes some differences of MapReduce and parallel SQL DBMS. For example, all DBMSs require a well-defined schema, while MR permits data to be in arbitrary format. For indexing, modern DBMSs use hash or tree index to accelerate access to data. On the other hand, MR model is so simple that it do not provide built-in indexes, and programmers have to implement indexes to speed up the data access in their applications. For fault tolerance, MR framework provides a more sophisticated failure model than parallel DBMSs.

After talking about these differences, the paper starts to compare their performance and the reasons for performance difference. Basically, parallel databases have performance advantages. For example, B-tree indices can speed the execution of selection operations, and sophisticated parallel algorithms are efficient to query large amounts of relational data. On the other hand, MR does a superior job of minimizing the amount of work that is lost when a hardware failure occurs. However, this comes with a trade-off of the cost of materializing the intermediate files between the map and reduce phases.

The strength of the paper is that it provides a clear comparison between MapReduce and parallel DBMSs in various aspects, including schemas, indexing, programming models, data distribution, flexibility and fault tolerance. It gives reader an idea of the difference of the two paradigms, and therefore the design consideration when readers have to implement DBMSs.

The weakness of the paper is that when describing the difference of MapReduce and parallel databases, the paper does not provide many examples to illustrate these ideas. I think it would be better to describe the difference with some examples.

To sum up, this paper compares various aspects between MapReduce paradigm and parallel SQL database management systems. It also compares their performance and discusses the reasons for the performance difference.



Review 26

This paper was a comparison between MapReduce and parallel DBMSs for doing large scale data analysis and the differences and benefits between the two. It compared three different ways to perform large scale tasks, Hadoop, DBMS-X and Vertica. Hadoop is the leading open source MapReduce program, DBMS-X is the leading parallel DBMS, and Vertica is the leading columnar parallel DBMS.

This paper highlighted what things Hadoop was by far the best at and what things Vertica was by far the best at and what things DBMS-X was the best at (which wasn’t really anything). It had good discussion about what things Hadoop severely lacked and introduced some potential to using MapReduce instead of the normal parallel DBMS architecture.

I think this paper was really good for giving an understanding and broadening horizons to show multiple ways to solve a problem. The success of MapReduce lead to a large overusage of it for things that could be better done without MapReduce and this paper does a good job of showcasing that.

I think a weakness of this paper was that it went too in depth for what I view its goal was. I view its goal was to showcase that MapReduce isn’t right for every type of problem and sometimes just using a parallel DMBS is a better solution. I think that could have been proven with either less examples or the given examples but a less in depth analysis of them.

Overall, I think this was a solid paper and it did a good job of making me (and probably many in industry more importantly) reevaluate usage of MapReduce and show that it’s not always the best thing.



Review 27

This paper, as the title indicates, is not really a novel introduction of a new system but a comprehensive review of many large-scale big data processing systems and their architectures. While there are several different approaches to analytical workloads, the systems described may not exactly be transactional (e.g. MapReduce), despite the fact that there are many bulk operations and queries over large portions of data. Much of this paper describes parallels and differences between the MapReduce architecture and other distributed/parallel dbms architectures.

In terms of programming, although MapRedue does not provide native indexing support or optimization methods for MR query jobs, it is usually very easy to write many complex tasks as a set of simpler MR jobs, which would normally require more complex joins/subqueries/user defined functions in a traditional DBMS system. This makes SQL as representative as MapReduce, but there is not high-level abstraction layer. In addition, if queries are interrupted by failures during the middle of a job/process, this must be handled in traditional DBMS systems, whereas MapReduce has taken this into account during its design phase. The paper then goes on to test performance benchmarks on three systems (Hadoop, Vertica, DBMS-X) across different tasks. The conclusions they seem to draw indicate that MapReduce and Hadoop are nice for out-of-the-box processing and ease of use for situations in which load time needs to be fast for singleton-esque queries without repeated access. However, DB systems still seem very competitive, especially in terms of scalability, despite claims from MapReduce to be much better.

It is very reminiscent of the earlier Stonebraker papers of why DB systems are picked up (i.e. because of popularity). That is, since Hadoop is a massive open source Apache project and can be used for large-scale parallel processing, it has been widely adopted as the go-to architecture. In addition, it is very intuitive and easy to set up (having worked with it myself), but queries are massively slower due to parsing/indexing/execution processing. As this is more of a review, I thought this paper gave a good summation of the different situations in which each architecture is optimal (for scalability parallelism, etc), and maybe something that would have made it better would be more thorough analyses in terms of comparing more systems or executing more complex benchmarks (something more than a simple Grep).


Review 28

The paper explains the performance comparison of large scale data analysis between two approaches: MapReduce and Parallel DBMS. The paper briefly explains the concept of each approach and then performs comparison between MapReduce (implemented in Hadoop) and two parallel DBMS (DBMS-X and Vertica), with overall result leaning toward parallel DBMS although not without significant trade-offs. The paper then discusses the factors that contribute to performance for each system.

First, the paper explains the concept of MapReduce and Parallel DBMS. Second, it elaborates on architectural elements where each system differs (schema support, indexing, programming model, data distribution, execution strategy, flexibility, fault tolerance). After that, it shows the performance benchmark result from the three systems. There are original MapReduce task (Grep task) and analytical tasks (selection task, aggregation task, join task, and UDF aggregation task). For each task, aside from comparing the performance of the main task, the paper also compares the data loading process (in which case, MapReduce beats the other two). Then the paper continues with more detailed discussion about factors that contributes to each system’s performance advantages and what are the tradeoffs. The paper’s conclusion remark are: (1) at the scale of experiment conducted, both parallel DBMSs displayed a significant performance advantage over Hadoop MR in executing a variety of command because the two DBMSs is more mature, thus having more established features such as B-Tree indexes, storage mechanism, aggressive compression technique, and sophisticated parallel algorithms for querying large amounts of relational data; (2) Hadoop is much more easy to install than the two DBMSs; (3) Hadoop is more extensible, especially in handling the UDF aggregation task; (4) MR has better fault tolerance, although it comes with a potentially large performance penalty; (5) The “schema later” concept in MR means that parsing record at runtime is inevitable, and user must write a custom parser for every queries. Finally, the paper states that despite their differences, the two systems has its own good and bad, and both will still be widely used and developed.

One major contribution of this paper is that it does an almost thorough comparison between MapReduce and parallel DBMSs. Not only it compares the performance in each phase of the task (data loading and execution), it also explains the reason behind the result for each factor that could affect the performance such that the readers understand what the real trade-offs are between these systems. Large-scale data analysis has been a boom recently, and there are lots of new approaches on how to handle it more efficiently (when compared to the old/current method), but not many of them compare the whole workings between two systems.

However, while this paper keeps pointing out that parallel DBMS is an older and more established concept, it is not always the case why it outperforms MapReduce. As a matter of fact, Vertica uses C-Store architecture, which could be considered quite new, not to mention a radical approach, to the world of traditional DBMS. The storage and logical concept of C-Store differs significantly from row-oriented DBMS (in this case, the DBMS-X). I think the C-Store architecture plays a big role in the performance result. Another thing is, it seems that the performance of MapReduce itself also depends on the system it is build (the Google’s MR may be better than Hadoop?).



Review 29

The purpose of this paper is to compare MapReduce platforms to parallel DMBS platforms, because as in the “What Goes Around Comes Around” Michael Stonebraker paper we read early in the semester, he believes history is repeating itself again in this scenario.

The technical contributions in this paper come mostly from its comparing and contrasting of MapReduce platforms and parallel DBMSs. The paper provides specific examples of differences in implementation between the two systems. It also steps through the specific aspects of implementation and design decisions made in the systems to compare and contrast those as well. I think one of the main disadvantages of MapReduce that is pointed out in these various sections is its lack of ability to handle a system with multiple users. In order to have shared resources between multiple users or multiple applications, the programmer must do a significant amount of additional work both in generating and maintaining shared structures. MapReduce also lacks the ability to process complex data types without significant programmer assistance, and this is what causes the authors to compare MapReduce to the CODASYL databases of the 1970s. While I think that this paper points out important weaknesses in the MapReduce framework, I do not think they make adequate concession to MR given that it does have specific use cases in which it performs well. They present empirical results that indicate that, though data loading tends to take longer, their traditional parallel RDBMS and a column-store DBMS both outperform MapReduce on their benchmark queries.

As far as strengths go, I do think this is a thorough comparison of the two systems. I also think that the authors do a good job justifying all of the design decisions they make in this setup. They do an especially good job of justifying their decisions that the reader might initially think are nonsensical, such as their choice of using only 100 nodes in their test system.

I think this paper has one main weakness. I wish that the MapReduce coalition has some area for a “rebuttal”. It is clear from the abstract of the paper, that the authors are looking to pick a fight with the MapReduce community, and I honestly think this would be a stronger paper if the language didn’t make it seem like the one goal was to prove that MapReduce wasn’t that good. A more impartial approach would have produced stronger results.



Review 30

Review: A Comparison of Approaches to Large-Scale Data Analysis
Paper Summary:
This paper provides a thorough comparison between the then recently proposed MapReduce model and the traditional parallel SQL DBMS by assembling a benchmark test set and perform testing and three candidates. From the result it discusses comparison, factor of differences and potential future works.
The motivation and the background of this paper is the raise of the idea of “cluster computing” and the proposal of MapReduce, which provides computational model to achieve this goal. This paper attempts to answer the question of why MapReduce is better by testing and comparing the two types of models on a proposed benchmark. The comparison is done in multiple aspects including performance efficiency, implementation difficulty and system costs.

Paper Review:
In section 2 the paper gives a brief review of both of the two models. Each of the reviews is quite informative and clear. However it can be even better to make a direct comparison between the two models.
The strength of the paper is that the comparison experiments are very thorough. It provides detailed results via diagrams and plots on many different tasks. It is interesting to see that the two parallel DBMS can outperform MR so significantly given that MR was the then super star in the field. According to the paper the strength of the MR model comes from the easy set-up however it is very questionable that it is the only factor that MR becomes so welcomed in the literature.
Meantime the author gives some judgments against MR that seems to be rather unfair. For example it mentions that MR doesn’t provide a hashing/indexing algorithm so the programmers have to implement their own. And the author argues that this can be inconvenient when the indexing algorithm needs to be shared because it would be redundant for many programmers to implement the same algorithm. I really cannot understand the point of such “inconvenience” given the fact that sharing an implementation is a matter of cloning a repository or simply a copying and pasting of code, unless I missed some important factor that is behind the scene of this argument.



Review 31

In this paper the authors compared two popular approaches for large-scale data analysis, MapReduce(MR) and parallel DBMS. In this paper, both architectures run on shared-nothing systems. The MR consists of two stages: the Map stage that reads data from distributed file system and performs filtering or transformation, the Reduce stage that aggregates the shuffled output from Map stage. The parallel DBMS is a totally different solution compared to MR. It usually supports strict relational schema and SQL. The parallel DBMS also equips with standard database features such as indexing and transaction support. The authors also designed a set of benchmarks to evaluate both approaches.

The main different between MR and parallel DBMS include:
1.Parallel DBMSs require data to fit into the relational paradigm of rows and columns, while the MR model does not require a strict schema for data.
2. Hash or B-tree indexes are supported in parallel DBMS. The MR model doesn’t provide build-in indexes. Shared indexes are especially hard to supported in MR.
3. The parallel DBMS uses a declarative programming model. The MR is more similar to Codasyl which requires programmer to provide low-level algorithm for record manipulation.
4. The parallel DBMS automatically handles data distributed and replication. The MR relies on programmer to define data distribution strategy.
5. The MR is good at support sophisticated work load, while SQL is restricted to its syntax. The MR handles fault-tolerance in finer granularity, compared to parallel DBMS.

The main advantage of this paper is to systematically compares the MR and parallel DBMS, thus giving us an overview of the advantage and disadvantage of both systems. Another contribution is to provide us a set of benchmark that evaluates large scale data analysis tools.

One weakness of this paper is the lacking of analysis on the breakdown of time. For example, the authors mentioned that the MR spends large amount of time on retrieving data from disk, while parallel DBMS uses index to save time. To better understand its effect, I would like to know how much time is spent on this part on each benchmark.