Many data analysts need the ability to run data mining or machine learning algorithms on big datasets, which require a distributed DBMS to manage. MapReduce provided a useful framework for such tasks, which provides built-in tolerance for server failures, automatic load balancing and partition management. MapReduce made it easier for data analysts to develop distributed jobs without worrying about the underlying framework that ensured their jobs would complete.|
Each MapReduce job is limited to being a task that can be expressed as an acyclic data flow graph. Specifically a MapReduce job comprises a map step that produces a set of tuples, followed by a reduce step that derives the answer from those tuples. The Spark cluster computing system provides a distributed abstraction like MapReduce for programmers, but it allows jobs to repeatedly operate on a working set of data. This makes it more efficient to perform iterative tasks like gradient descent, a key part of machine learning algorithms such as a standard logistic regression implementation. Admittedly, MapReduce is capable of running iterative algorithms like gradient descent for logistic regression, but it requires repeatedly loading data from disk, because each iteration is a separate MapReduce job with its own load stage.
The Spark system allows programmers to request that data be cached in memory, so that it will be quickly available for a series of operations, such as iterations of gradient descent, or repeated grep operation. Spark shines in comparison to MapReduce when data can be reused from memory. For example, grep-ing a large distributed file in MapReduce takes tens of seconds because it must be loaded on each machine, but a second pass of grep on the file in Spark returns within a second, as the data remains in memory on each machine. The key feature of Spark is the resilient distribute dataset, which is a partitioned file stored as diffs from some underlying store, allowing the data to be recreated on demand, such as if the server holding some partition crashes.
One limitation of the paper is that it does not explore whether MapReduce clusters could be altered slightly to allow caching if intermediate results, and whether this would largely negate the benefits of Spark. If MapReduce users could provide a cache flag for map or reduce outputs, then possibly iterative algorithms could work efficiently in a MapReduce cluster.
This paper focuses on Spark, a framework for cluster computing with working sets. MapReduce and its variants are successful in implementing large-scale data-intensive applications on commodity clusters but its built around an acyclic data flow model which are not suitable for iterative algorithms. Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter. Interactive analytics is also not possible since it requires loading a dataset of interest into memory across a number of machines and querying it repeatedly. It also requires an external stable storage system. All this combined with substantial overheads due to data replication, disk I/O etc motivated the work on Spark. |
The programming model consists of a driver program which implements the high-level control flow of their application and launches various operations in parallel. Parallel operations pass a function to apply on a dataset and shared variables are used in functions running on the cluster. Resilient Distributed Datasets are used which make sure that even if a partition is lost, the collection of objects partitioned across a set of machines can be rebuilt. Users can alter the persistence of an RDD through Cache action (saving it in memory on the pretext of using it again) and Save action (evaluating the dataset and writing it to a distributed filesystem).
The paper is successful in highlighting the concept of Spark. It provides a simple and efficient programming model for a broad range of applications leverage the coarse-grained nature of many parallel algorithms for low-overhead recovery. The argument is supported by relevant experiments pitting it against Hadoop.
The paper provides a limited set of fault tolerant distributed memory abstractions. It also doesn’t talk about data sharing between distributed sites.
A new model of cluster computing has become widely popular, in which data-parallel computations are execution clusters of unreliable machines by systems that automatically provide locality-aware scheduling, fault tolerance, and load balancing. MapReduce pioneered this model. These systems achieve their scalability and fault tolerance by providing a programming model where the user creates acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to manage scheduling and to react to faults without user intervention. While this data flow programming model is useful for a large class of applications, there are applications that can-not be expressed efficiently as acyclic data flows. In this paper, we focus on one such class of applications: those that reuse a working set of data across multiple parallel operations.|
The main abstraction in Spark is that of a resilient distributed dataset(RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.
When a parallel operation is invoked on a dataset, Spark creates a task to process each partition of the dataset and sends these tasks to worker nodes. It try to send each task to one of its preferred locations using a technique called delay scheduling. Once launched on a worker,each task calls getIteratorto start reading its partition. Although our implementation of Spark is still at an early stage, they relate the results of three experiments that show its promise as a cluster computing framework.
This paper address the central question I had on mapreduce, it provide "iteration" which is commonly used in programming. However, that returns to the same concerns: it's that really worth to propose a new programming language for parallel computing?
This paper presents a new programming framework called Spark, which can support the applications MapReduce could not while retaining the scalability and fault tolerance MapReduce can provide. In short, Spark is trying to work on acyclic data flows (for example one will reuse a working set of data across multiple parallel operations) and achieve what MapReduce could at the same time. Spark introduces an abstraction called Resilient Distributed Datasets (RDDS) which will contribute to achieve the goal. Spark provides two main abstractions for parallel programming, one is RDDs and another is parallel operations on these datasets. Also it supports 2 restricted types of shared variables (broadcast variables, accumulators). This paper first provides an overview of Spark and introduction of RDDs, as well as some examples. Then it provides the implementation and how it can be integrated into Scala, as well as test results. Finally, it provides a discussion on related and future work.|
The problem here is that frameworks like MapReduce could not support for several applications while it could support many. There are applications that cannot be expressed efficiently as acyclic data flows, for example one reuses a working set of data across multiple parallel operations. There are many machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter (through gradient descent), so new framework is needed to handle these kind of applications and remain features what MarReduce could provide.
The major contribution of the paper is that it provides a good framework for applications that cannot be expressed in acyclic data flows, and remain most of features MapReduce could provide. The most important component is Resilient Distributed Datasets (RDDs). It is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. RDDs represent a sweet-spot between expressivity on the one hand and scalability and reliability on the other hand.
One interesting observation: this paper is very intuitive and the goal of the design is very straightforward. It is very interesting that Spark allows programmers to create 2 restricted types of shared variables to support broadcast and accumulators. They are very common and it will increase the performance and efficiency for programmers and applications. Sometimes it is good to consider programmers’ need in software design.
In this paper, the authors introduce Spark, which is a framework for creating applications based on the MapReduce model. The authors claim that previous MapReduce frameworks where not suitable for applications which reuse a set of data across multiple parallel operations, such as many machine learning algorithms and data analysis tools. They address this issue in Spark by introducing resilient distributed datasets (RDDs), which are a set of data objects partitioned accross a set of machines.|
I've used Spark in my projects, and the thing I like about it is its nice API design and usage of Scala as the main language, these two make creating apps really fun and easy. But in contrast with the author's claim that Spark is usually faster than other competitors like Hadoop, I've found Spark most of the time relatively slower than others.
This paper is something of a response to the MapReduce paper - it highlights some problems with the framework, and purposes a different system to deal with these problems. The authors point out that their are many applications that involve making multiple passes over the data, such as machine learning algorithms. With MapReduce, this involves a lot of reading and writing from disks. MapReduce is also not helpful for interactive queries - you have to program a job, have it scheduled, and wait for the results. Even using system such as Pig and Hive to convert SQL into map reduce jobs, there is significant overhead from reading everything from the disks.|
Spark seeks to handle this by taking advantage of memory. Since it does not use GFS or another fault tolerant file system underneath it like MapReduce, Spark must handle faults using RDD - resilient distributed datasets, which are read only collections of objects partitioned across a set of machines. These partitions can be rebuilt if a partition is lost. The system is built onto of Mesos, a so-called "cluster operating system".
This system is dramatically different from others that we have seen in that it allows access to cluster computing within a general purpose programming language. Users can write their programs in Scala, and then use certain functions that will take advantage of the cluster. Developers can take advantage of functions such as reduce, collect, and foreach, and shuffle(future feature at time of paper publication). Developers can also take advantage of broadcast variables and accumulators to perform their computation in parallel.
The authors claim 10x performance, however, it should be noted that this is in iterative machine learning workloads - something that Spark was designed for, and MapReduce is exceptionally bad at. It would seem that the system performs very well compared to MapReduce, especially as their are more and more passes over the data. I do wonder how helpful Spark is for general purpose computation, or if the system really only performs well in programs similar to the examples that they present. I do also wonder about the cost of the JVM - the system is written in Scala, which runs on the JVM on a distributed cluster - the overhead at each level could add up to something significant. It is (clearly) less overhead than reading and writing from disks, but still worth noting.
Problem and Solution:|
Spark is proposed because the MapReduce and its variants could only provide acyclic data flow, which is quite limited. Also, when processing multiple queries, MapReduce could only do the work individually even though they are based on similar data. The way to solve the problem is caching the data for reusing in iterations. Spark can support applications with non-acyclic data flows like iterative jobs and interactive analytics in machine learning an data mining. And it keeps the good properties of MapReduce like fault tolerance, load balancing and high scalability. In Spark, a driver program is written by the user, which implements the control flow to run the programs in parallel. The key part of Spark is resilient distributed dataset (RDDs). It is one way to partition the dataset. Parallel operations works well on the dataset, and Spark creates task to process the partitions and make them run in the worker machines. Delay scheduling is used to improve the performance.
Spark provides a limited but efficient set of fault tolerance distributed memory abstractions with resilient distributed dataset and restricted shared variables. And it creates the powerful resilient distributed dataset. It is a read-only structure and partitioned over worker machines.The dataset can be transformed with map and filter. Also it can be cached across parallel operations. Parallel operations like reduce, collect, foreach can be used on RDDs with limit shared variables like add-only accumulators and broadcast variables.The RDD and parallelization on the dataset are the key abstraction of Spark.
Though Spark successfully solve the problem of non-acyclic data flow, it still has some weakness. The weakness lies mainly on the RDD. In current version of Spark, the transformations and the persistence options of RDD are limited, which makes the model not flexible enough. Also, it is also a big shortcoming that RDD could not be updated.
This paper introduces Spark, a new cluster computing framework that support applications with working sets. Spark focus on applications that continuously reuse a working set of data across multiple parallel operations, by introducing a resilient distributed dataset (RDD). |
RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark allow users to construct RDD, and provide fault-tolerance by having a “lineage” that allow the system to reconstruct the data even if part of it fails. Spark supports reduce, collect and foreach operation to be performed on RDDs, along with shared variables such as broadcast and accumulate. The performance benefits from the RDD allow Spark to outperform MapReduce by 10x.
This paper introduces an interesting system, but is a bit weak on the details of how it implemented RDD, what exactly are the benefits of it, and what are the downsides. It shows evaluation in comparison to MapReduce, but does not analyze what are the factors that is allowing Spark to perform better or how well it can scale in performance as data gets larger.
This paper introduces Spark, which is a variant programming model of MapReduce and focuses on the performance of the applications which reuses a working set of data across multiple parallel operations. Spark provides two main abstractions, resilient distributed dataset(RDD) and parallel operations. RDD is a chained objects partitioned across machines and each partition of the object can be rebuilt from bottom by reading the raw data from disk(input file). To construct such object, spark does not distribute data across machines but assign a handle to an RDD which computes all the information from scratch. Parallel operations can be performed on RDDs. They are reduce, collect and foreach. Spark also contains shared variables like broadcast variables and accumulators. In addition, the paper introduces an implementation of spark in Scala. The tests show that it may outperform Hadoop in maganitudes.|
Spark exploits the main memory efficiently. Spark can explicitly cache an RDD in main memory across machines. In this way, if the data is to be reused in the future, this approach will save many disk I/O and network shipping cost and gain high processing speed as main memory is much faster than disk. Spark is especially suitable for those workload that reuses a set of data, such as pattern search. For now, main memory becomes larger and faster. As a result, Spark, which tries hard to utilize the main memory, might become the trend.
However, Spark suffers several drawbacks: Spark may consumes large amount of resources. As the user can explicitly make an RDD to be resident in memory, there might be a high usage of memory just for a single task if the task goes big and user wants to make every thing in memory. On the other hand, Spark tends to consume too much CPU time.Since Spark distributed data across machines by handler rather than data, an RDD objects have to be recomputed every time if they are not in the main memory. This recomputation leads to more CPU usage.
MapReduce introduced a simple programming model that could be easily and automatically run on a distributed system, with automatic scheduling and fault tolerance. However, MapReduce jobs are one-and-done pipelines, which does not allow for efficient iterative processing, or interaction from the user. Spark addresses these problems, providing the automation and fault tolerance benefits of MapReduce while supporting different job types that would be hard to write in MapReduce.
Spark introduces a variable type called a Resilient Distributed Dataset (RDD), which is a variable representing a collection of data objects that may be distributed across many machines. RDDs can be defined (initialized) from files or other sources. Like regular variables, the RDD type has operators that can be used to manipulate it. RDD’s operators are parallel operations like foreach() (which applies a function to each object in the RDD), or reduce() which calls an associative function to combine the elements of the RDD into some result. MapReduce’s operations can be simulated using these operators. However, unlike MapReduce, these RDDs act like variables and persist- therefore, they can be operated on in an iterative manner, like in a for loop. The user can give hints to the system that certain RDDs will be used again, prompting Spark to cache the RDD in its systems’ memory.
Since RDDs are often created by manipulating other RDDs, each RDD can keep track of its lineage, or the “ancestor” RDDs that were involved in creating this RDD, and the source file that the data was first read from. This way, when a node fails, the RDDs know how to rebuild the data that was lost.
On iterative jobs, Spark performs much better than MapReduce because it is able to cache its results as an RDD, and doesn’t have to start a new job for each iteration. In interactive usage, Spark also is able to have good performance, since it can cache its RDDs so that future queries can be executed quickly.
This paper clearly points out flaws with MapReduce’s model. It clearly explains the main concept of RDDs, and why they enable new kinds of jobs compared to MR.
The implementation section was surprisingly short- I feel like they glossed over a lot of implementation descriptions that could have been interesting.
The paper introduces a new parallel programming framework, called Spark. The framework is designed to support a set of applications where they reuse a working set of data across multiple parallel operations. These applications can also be processed with Hadoop MapReduce, but Hadoop was designed to have an acyclic workflow, making it inefficient to process the same set of data in an iterative fashion.|
To achieve the goal of supporting iterative parallel operations, the authors define a data abstraction, called resilient distributed datasets (RDDs). A RDD is read-only collection of objects that can be reused with caching and also achieves fault-tolerance by implementing a notion of lineage. A lineage of a RDD contains the information of how it is actually derived from other RDDs or data in the disk. This makes it possible to reconstruct a RDD in other nodes whenever the node originally had it fails. The paper claims that Spark achieves speed up by 10x compared to Hadoop in iterative machine learning jobs.
The main contributions of this paper is a clear identification of a set of problems that cannot be solved efficiently using existing solutions (i.e., Hadoop) and a proposal of a new solution that works well for the defined problem. The authors have found the need for running iterative machine learning jobs in parallel very well and Spark has become really popular.
The paper was written when Spark was still in its early stage of the development. This has made the paper to focus on explaining in the examples of its applications and the high-level overview of Spark’s architecture. Due to this, I had an impression that the paper was lacking details in explaining how their architecture works. For example, I would like to know how a node determines the existence of a cached dataset for a given line, but I could not find the detailed explanation in the paper. The paper does a good job to make readers understand Spark conceptually, but it could have done better for readers who are interested in the actual implementation of the system in my opinion.
To sum it up, the paper has found its problem domain well and provides an efficient solution to the problem. Its data abstraction and underlying architecture are well explained in the paper conceptually, but readers who are interested in how it is implemented in detail may not be satisfied with the paper.
This paper introduces the Spark framework as an improvement on MapReduce. MapReduce makes an assumption that data flows is acyclic. However, this assumption does not work in applications where either iterative computations are performed on the dataset or when a user wishes to query a dataset repeatedly. In both situations, MapReduce would re-perform its operations and this results is significant overhead.|
Spark introduces two key ideas to support these situations: resilient distributed datasets (RDD) and shared variables. An RDD is a collection of read-only objects partitioned across machines that can be rebuilt if a partition is lost. The unique idea is that an RDD can be cached in memory to be reused later in MapReduce operations. Shared variables come in two types. Broadcast variables consist of data that is used in multiple parallel operation. When marked as broadcast, the driver that manages the computations copies the data only once to each worker rather than copying it every time a task is assigned to a worker. An accumulator is a variable that works can only add to and can only be read by the driver. Accumulators can be used as global counters and are fault tolerant.
The combination of RDDs and shared variables gives Spark the ability to support iterative computations and multiple queries without having to reload the raw data from disk as would be the case in MapReduce. I think this paper is a good patch to the MapReduce framework, which was designed to only look at dataflow in one direction. However, it seems as if it should have also considered cyclic data flows. An interesting idea would be to combine the ideas in both frameworks to create a more general purpose programming interface that allows users to support any type of data flow. There may be situations where it would be beneficial to first use MapReduce to get the initial data partitions and then apply Spark to the output of MapReduce or vice versa.
This paper proposes Spark, a new framework that is more flexible to support other popular applications while retaining the features in MapReduce. This paper mentions that Spark focuses on efficiently dealing with applications that reuse a working set of data across multiple parallel operations, while MapReduce programming model is only specialized in acyclic data flow graphs that input data passes through a set of operators.|
Spark programming model includes three data abstractions: "resilient distributed datasets (RDDs)" and two shared variables "broadcast variable" and "accumulator"; some additional parallel operations such as "foreach" are also proposed.
The experiment performed using an iterative task (logistic regression) shows that although Spark is slightly slower than Hadoop in the first iteration, it is dramatically faster than Hadoop in the following iterations. This is because Spark reuses cached data while Hadoop does not.
1. This paper gives an overview of Spark and provides several example applications to demonstrate how Spark would be useful.
2. This work integrates Spark into the Scala interpreter. It allows future users to interactively process large data on clusters by an efficient, general-purpose programming language.
1. Based on this paper, Spark seems to consume a lot of memory comparing to Hadoop. This paper does not consider memory usage in the experiments and only compares the running time.
Purpose of Spark|
The motivation behind Spark is that it allows the reuse of a working set of data across multiple parallel operations, such as in machine learning algorithms and data analysis tools, while preserving the scalability and fault tolerance of MapReduce. Spark does this by using resilient distributed datasets (RDD), which are read-only collection of objects partitioned across a set of machines that can be rebuilt upon a loss of partition. This is important because MapReduce, while successful in implementing large-scale data-intensive applications on commodity clusters, are not as suitable for applications that are not an acyclic data flow model. This paper describes the implementation and examples of Spark.
Details of Spark
Spark lets programmers construct RDDs from a file in a shared file, such as the Hadoop Distributed File System (HDFS), by parallelizing a Scala collection in the river program, by transforming an existing RDD from one type to another, or by changing the persistence of an existing RDD—from being materialized on demand in parallel operation and discarded from memory after use to using a cache action that keeps the memory after the first time computation or using a save action that evaluates the dataset and writes it to a distributed filesystem. The parallel operations that can be performed on RDDs are reduce, which combines dataset elements using an associative function to produce a result at the driver program, collect, which sends all elements of the dataset to the driver program, and foreach, which passes each element through a user provided function. Spark also provides the user with functionality of shared variable: broadcast variable, which wraps the value of large read-only data and ensures it is copied to each worker only once, and accumulators, which can only be added by workers using an associative operation. One example of a Spark program is Text Search, which can count the lines containing errors in a large log file stored in HDFS by creating a set of lines containing errors, mapping each line to a 1, and then adding up the ones using reduce. Spark stands out because it can make the intermediate datasets persist across operations. Spark is built on Mesos, a cluster operating system that allows fine-grained sharing among many parallel applications and provides an API for applications to invoke tasks on the cluster. The RDD object implements getPartitions, which returns a list of partition IDs, getIterator(partition), which iterates over a partition, getPreferredLocations(partition), which schedules tasks for data locality. Spark is integrated in the Scala interpreter by making the interpreter output the classes it defines into a shared filesystem that can be loaded by the workers using a custom Java class loader and changing the generated code so that the singleton object references previous lines directly, bypassing the getInstance methods. In the experimental results, Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can query a 39GB dataset with sub-second response time.
Strengths of the paper:
I liked that the paper was concise while still providing a comprehensive understanding of the advantages of Spark, its implementation, and its applications. I liked that the application examples, logistic regression and alternating least squares, are applications prominent in machine learning in the real world.
Limitations of the paper:
I would’ve liked to see the implementation of the grouped reduce operation, so that reduce results could be collected at more than one process. It would have been interesting to see whether more complex machine learning applications, such as letter recognition, that could be applied with grouped reduce functionality.
This paper is about Apache Spark which was inspired by MapReduce. MapReduce allowed scalability and fault tolerance but worked better for acyclic jobs. Spark shows that cyclic jobs can be implemented more effectively by implementing resilient distributed datasets. These datasets describe modifications to how data is stored and cached within nodes in the cluster to reduce runtimes. The paper describes iterative jobs, for example, machine learning applications or interactive queries; those sent from an interpreter. The paper examines how Spark is implemented using Scala and Mesos and looks at a few example programs. The programs are shown to run much faster than on Hadoop and the paper concludes by summarizing and explaining future work, most of which has been implemented today.|
This paper has some clear strengths. It presents paradigms of MapReduce and where they fall short. This motivates the paper and leads to clear reasoning about the solution, which is to implement abstractions to support cyclic data flows on a distributed framework while maintaining the properties of scalability and fault tolerance. The improvement over Hadoop are applications to ML and interactive jobs and the paper concludes with remarks about future work to provide more control and functionality to users.
There are not a lot of drawbacks to this paper. The paper is clear about the contribution it is making to cluster computing jobs which use working sets. The authors are aware of the fact that although they give some great examples of jobs that are appropriate for Spark they don't do a great job of characterizing the types of workloads that work better with RDDs. Some of the original authors have published a paper this year that explores this question in more detail. The authors provide only the Scala interface with this paper, but in the future produced Spark SQL and interfaces for languages such as R and Python to solve this issue. Furthermore, the paper doesn't allow reduce operations to easily be regrouped using group-by and join operations. This was also later implemented as a "shuffle" operation. All of these faults which were laid out in the original paper have been addressed by later publications.
Part 1: Overview|
MapReduce and Dryad, these famous distributed Data-flow Computation Frameworks are lack of iterative application support. There is increased overhead in doing iterations between intermediate results. Therefore they are sometimes not good for some acyclic model. To solve the problem, this paper presents a distributed memory abstraction called resilient distributed datasets(RDDs) through which users can explicitly cache data in memory across machines and reuse these caches in multiple MapReduce-like parallel operations to avoid the performance bottleneck while multiple iteration is needed. RDD is a read only collection of objects partitioned across machines.
Spark is the major result of this paper. It is an open source cluster computing system that boost the performance of multiple-iteration MapReduce-like program by adopting the in-memory programming model. However at the same time it also provide the Scale API to help programmers to develop their applications. They claim that Spark can outperform Hadoop by 10X in iterative machine learning jobs.
Part 2: Contributions
Spark is easy to implement and many developers can save their hard work on the existing big data mining tools without touch the hardware stuffs. Instead of purchasing new hardware machines they can just run their applications of Spark. Spark can be easily used in text search, logistic regression as well as alternating least squares on a large scale.
The authors provide an experimental evaluation, comparing themselves with Hadoop. This shows that their caching scheme and the absence of overhead (for Hadoop and HDFS) result in a much more efficient implementation. Micro benchmarks show the ability to gracefully degrade in the absence of sufficient amounts of memory of caching and that fault-tolerance really works for a naive fault.
Part 2: Drawbacks
However some details and test results are provided, they did not provide details about the RDD system. It would be easier for readers to understand the system if they could show some figures of anatomy of RDD and Spark. The Spark API is in Scala however it could be provided in Java just like HDFS and would be easier and more general to use.
The paper discuss a cluster-based parallel programming model called Spark. Spark is optimized for applications that reuse a working set of data across multiple parallel operations. Spark allows user to construct a read-only collections of objects called “resilient distributed dataset (RDD). Users can explicitly cache an RDD in memory across machines and reuse it in multiple parallel operations such as in reducing dataset elements using an associative function. For different iterative and interactive applications, caching the RDD and accessing it from memory instead of disk makes their execution time faster than MapReduce-kind of execution where all intermediate data is materialized into disk. If there is no enough memory, in the cluster, to cache all partitions of a dataset, Spark will recompute them when they are used. This allows Spark to continue working ,at reduced performance, if a dataset is too big to fit in the aggregate memory of the cluster. |
In addition Spark provides fault tolerance by maintaining enough information about how each RDD were derived from other RDDs in the physical memory. Based on these information, Spark can recompute any lost RDD, guaranteeing fault tolerance like MapReduce. Furthermore, Spark provides two specific shared variables called broadcast and accumulator. These variables are important because they have common usage with different applications. The broadcast variable allows to distribute a large read-only data (e.g a lookup table) to workers only once instead of sending it with every function, whereas the accumulator variables allows workers to do associative operation (e.g add to implement counter) to the variables that can be read by the driver/master.
The main strength of the paper is that Spark addresses application types that are not effectively supported by MapReduce programming model. MapReduce is not efficient to handle applications which are iterative and interactive. MapReduce has to materialize its intermediate data into disk whereas Spark can cache these data into memory and could assess these data faster in the next iteration of the applications. In addition, I found the technique used to make Spark fault tolerance interesting. Spark persistently stores information about how a particular RDD is constructed into secondary storage.
The main limitation of Spark is that it is applicable only to certain range of applications (iterative and interactive applications). Applications should be able to use the restricted Spark’s parallel programming model. This limitation is similar with MapReduce though Spark added some classes of applications which are not supported directly in MapReduce. In addition, Spark doesn’t support grouped reduce operation as in MapReduce. Reduce results are only aggregated at one process i.e in the driver/master. Incase of workloads with compute intensive reduce stages, the single reduce operation would become a bottleneck and increases the total execution time.
This paper presents a new cluster computing framework, Spark. It enables reusing a working set of data across multiple parallel operations. While it retaining the scalability and fault tolerance of MapReduce, it support cyclic data flow efficiently which MapReduce and its variants cannot.
The key notion of the paper is RDD, resilient distributed datasets.
2. read only
3. collection of objects
4. partitioned across machines
5. can be rebuilt if partition lost
6. not a general shared memory abstraction
7. sweet spot between expressivity & scalability
8. lazy, built on demand
Beyond that, the following concepts are key to the design.
1. high-level control flow
2. launches operations in parallel
2 main abstractions
2. parallel operations
2 restricted types of shared variables
1. broadcast variables, passed to each worker by only once
2. accumulators, worker can only add to it.
It explains where Spark can be more useful than MapReduce, and there are 3 examples that well demonstrate the advantages.
Even though the definition of RDD is clear but the paper is poorly organized so that the purpose of RDDs is not mentioned at the beginning, which confused me for a while. The problem is not stated clear either. It seems at the beginning that the only drawback of MapReduce is the disk read penalty between iterations, and one task in Spark is simply cache the intermediate data to avoid such reads.
This paper describes a new parallel data processing programming model called Spark. Spark is inspired by MapReduce and solved some deficiencies that MapReduce bear. |
Problems of MapReduce:
MapReduce is a great programming model, but it has some problems in its design when certain types of jobs are needed. MapReduce has only two stages and requires programmer to run multiple MapReduce jobs if they need more stages. This causes problem because between each job the whole dataset is written into disk and read back again. Because of this property, two scenarios in which MapReduce is deficient are Iterative job and Interactive analytics.
Spark extends the idea of functional programming further. The central concept of Spark is Resillient distributed dataset (RDD). Data files are read into Memory by creating an RDD. Then programmer can do transformations such as map, flatmap, or groupByKey and so on. This solves the problem of limited transformation. Each these transformations create a new RDD, and one could keep RDD in memory. This solves the repetitively write and read disk in mapreduce.
Spark is implemented in Scala, a statically typed language that runs on JVM. Scala has a good support for functional programming, which makes the Spark programming much easier.
This paper extended the idea of MapReduce further. It could be seen as a generalization of MapReduce. It defines a set of operation that is commonly seen in functional programming, thus used the power of functional programming. Also another contribution is that it uses memory to store all its computation intermediate results, making a sequence of operation much faster.
Although the idea presented in this paper is brilliant, it could be hard to grasp this idea simply by reading this paper. A lot of useful details are missing.
Spark specifically targets the idea that processed data in machine learning is often reused and the vanilla Map-reduce does not have a solution for being able to use that data repeatedly without the need for it to be loaded again. The downside of this is especially seen in case of wanting to run multiple kind of queries across the same kind of data for which Hadoop needs to reload it every time and this incurs a non-trivial startup cost. Spark introduces the idea of a resilient distributed dataset which is essentially a read-only collection of objects partitioned across multiple nodes that can be rebuilt.|
These RDDs differ on the basis of how it has been implemented and according partitions and preferred locations. Spark has two kinds of shared variables – broadcast variables and accumulators. The broadcast variable and its corresponding value is stored in the shared file system. When a particular broadcast variable needs to be referenced, the system first checks if the value exists in the local system else it reads it from the shared location. Accumulators, on the other hand, are given a unique id on each of the workers it is currently being used in. At the end of the task, the workers send a message to the driver program with the updates to the accumulator variable. The driver applies these changes only once.
One of the strongest contributions of Spark is that it can rebuild the partitions of the RDD using lineage i.e; it has enough information about how it was derived from other RDDs to be rebuilt again. Also, the lazy computation of RDDs means that all the data need not be loaded right at the beginning, Spark has the capability of computing the RDD when required. Since Spark uses SQL, it would be easy for someone to transition to writing code in Spark. Also, the RDDs come with a defined schema, which is clearly a difference in comparison to typical Map-Reduce.
One of the disadvantages is that, Spark may prove to be really resource-wasteful when it comes to one time processing of data. Also, since it operates specifically in-memory, if the data is too large to be able to fit in the memory, it would not be able to provide the advantages that it can provide otherwise.
Overall, Spark would be a great choice for computationally intensive and iterative experiments.
Spark: Cluster Computing with Working Sets|
In this paper, a real world problem is addressed with the existing tools for parallelizing large-scale data processing. As introduced before in the Map-Reduce paper, the previous frame-work can only support iterative running of map and reduce functionalities. There is no way of efficiently running a machine learning algorithm which would access shared data multiple times under the MR-framework. Because the data will need to be accessed every time when a new iteration is started, the latency of disk read will almost dominate the run time.
The new solution for such a problem in this paper is a new cluster computing framework called Spark, which dramatically boost the runtime of algorithms with cyclic data flow by keeping a working set of data for multiple parallel operations in memory. The two concepts of resilient distributed datasets(RDDs) and shared variables are introduced to achieve the boosting. Resilient distributed datasets(RDDs) can be loaded and computed for a single time and be accessed by worker machines for several times. And Spark also introduced the shared variable accumulator to allow add-only data aggregation between multiple worker nodes and broadcast values to provide shared read only persistent data piece to every clouser.
The major strength of this paper are that, firstly it allows the efficient running of algorithms like logistic algorithms which repeatedly use a working data set, since the subsequent iteration over the data set will not need to read from the disk. Secondly, it supports the interactive queries on the data since it can keep the data on memory. For example they cache the 39GB wikipedia snapshot to disk and can query it in less than a second after the first query since it is already kept in memory.
However, there are still many weakness in this paper. One of the most important one is that this paper is incomplete. The only information provided in this paper about fault tolerance is that the RDDs can be reconstructed once it is missed. But it never says how to implement this kind of mechanism, let alone the performance test. So the users may not be able to understand what is the cost of system corruption. The other questions is that, in this paper, the logic of concurrent execution of many tasks is never mentioned. And reader will care about this issue because the ram can be consumed by one single big task and other tasks may be starving.
As the map-reduce is built around an acyclic data flow model and not suitable for some machine learning algorithms or interactive data analysis tools that rescue the working set across multiple parallel operations. This paper introduce a new framework ned Spark that can deal with such problems.|
The Spark introduce a abstraction resilient distributed dataset that represent a read only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Then the user can explicitly cache the RDD in memory across the machines and reuse it. This abstraction reduces the overhead of outputing and reloading data to disk, and retain the property of automatic fault tolerance, scalability and locality-aware scheduling. The Spark also has two kind of shared variable type: Broadcast variable is distributed to the workers only once instead of copy every closure. Accumulators only support “add” operation.
Later the paper shows three experiment result that proves that the Spark can outperform Hadoop by 10x in iterative computations.
The paper introduce a new programming framework that outperforms the map-reduce for interactive computations and explain how this model deal with the problem of map-reduce like caching and also some program sample.
I think as the spark uses cache, the memory limitation will be a problem, because the spark uses computer cluster, increasing memory for computing cluster would be very expensive. The paper doesn’t clearly explain how the spark deal with lack of memory problem which seem critical.
The paper presents a new data abstraction called resilient distributed datasets (RDDs) along with an implementation of a system to run computation using this abstraction called Spark. The motivation for this work is to address shortcomings in prevailing systems, namely MapReduce. MapReduce is ill-suited for a number of workloads, most notably pipelined and iterative computation. Spark is built to address these needs by using a more general programming model and using lazy evaluation to optimize computation. Finally, the authors demonstrate that Spark provides massive runtime improvements over Hadoop MapReduce.|
The key insight made by the authors of this paper is that by combining the use of functional programming operators with lazy evaluation, Spark can optimize computation placement and provide fault-tolerance. Data transformations do not need to be applied when requested, so instead Spark delays their execution until a parallel operation is run (or the RDD is materialized). This allows Spark to eliminate excessive IO and provide a more general interface than MapReduce.
Spark demonstrates that more general programming models can be used for distributed data computation besides MapReduce, and that by running computation in-memory, latency can be drastically reduced. However, this paper has some shortcomings:
1. Spark runs all computations in-memory. Some datasets are simply too large for this and much spill over to disk. Assuming that Spark eventually adds support for this, how will Spark avoid performance thrashing due to repeated data movement between disk and memory?
2. The properties of RDDs are not fully characterized in this paper.
3. This paper does not explore the limits of Spark's applicability to real workloads. In other words: RDDs are not suitable for all workloads, but what important workloads are RDDs unsuitable for?
This paper introduces Spark, a new framework that supports MapReduce while also supporting a working set of data that can be used across operations. The traditional MapReduce algorithm is not sufficient in the following ways: iterative jobs do not take advantage of using the same data set for each iteration and interactive analytics also has to load the same data set for each query. Spark takes care of these insufficiencies by using a resilient distributed dataset or RDD.|
Spark allows the user to create these RDDs so that they can be reused for each query. Some of the parallel operations that can be performed on these RDDs are reduce, collect, and foreach. For example, when doing a text search, we can cache the dataset so that we can run different reductions on the same text search with a RDD.
Using this intermediate caching, we get better results than Hadoop does when doing iterative jobs. For the logistic regression, the first iteration of Spark runs a little bit slower than the first iteration of Hadoop because we need to cache the data set as well. However, for every other iteration, we can reuse the data set so jobs run up to 10x faster compared to Hadoop jobs.
Overall, this paper introduces a new technique into the industry that allows us to run MapReduce jobs a lot faster by caching data sets. However, I have some concerns about the paper:
1. In the comparison portion, the authors mention that the job slows down by about 20% on average when a node crashes while a job is running and that smaller block sizes can help minimize this slowdown. Is there any way to somehow maintain multiple caches so that when a node does crash, we are still able to run the job without that much of a slowdown?
2. If we have a data set that is very similar but not exactly the same as the previous data set used, is there any way to still use a cache to speed up that MapReduce job?
This paper discusses Spark, a framework that achieves the scalability and fault tolerance of MapReduce, but provides additional features to improve performance for applications that reuse a working set of data across parallel operations. MapReduce dramatically simplifies the programming of parallel and distributed applications, but its performance suffers on applications such as iterative machine learning algorithms and interactive analytic queries, since the working set of data must be loaded from disk by a separate MapReduce job for each query. By offering the ability to retain persistent datasets, Spark can achieve up to a 10x performance gain over MapReduce systems such as Hadoop for these kinds of applications.|
Spark’s main contribution is the Resilient Distributed Dataset (RDD), a read-only collection of objects distributed across the cluster. RDDs are not necessarily maintained in memory at all times, but contain enough metadata to be able to compute their dataset from data in reliable storage. This allows for easy reconstruction of RDDs when nodes fail and also allows the RDDs to be loaded lazily when they are needed. Users can cache an RDD, which keeps the dataset in memory after the first time it is computed, unless it is too large to fit in main memory. They can also save the dataset, which calculates the RDD and saves it in a file for use in later queries.
Spark also solves the problem of repeatedly passing large variables to MapReduce jobs within closures. When a large piece of read-only data is going to be used in many subsequent queries, the user can declare it to be a broadcast variable. This variable is then cached by the worker so that it does not need to be sent across the network on every query. Accumulators work in a somewhat similar fashion. These are “add-only” variables that can be used to implement counters, sums, etc.
My chief concern with this paper is that it does not compare Spark to an existing MapReduce framework on any jobs other than those it was designed to specialize in. The experimental results clearly support Spark for the types of applications mentioned above, but a company with a mixed workload would likely want to compare the performance of two systems on a typical MapReduce job as well. If Spark performs well on specialized applications, but has some significant overhead for typical jobs, the benefits may be outweighed by the costs for certain workloads.
This paper introduces Spark, which support various applications of MapReduce while retaining the scalability and fault tolerance. MapReduce has been successful in processing acyclic data flow model. However, there are many useful applications that are not suitable for this model, such as various iterative machine learning algorithms. This paper proposes a new framework called Spark that supports these iterative algorithms, and use it to perform an application that reuse a working set of data across multiple parallel operations.|
First, the paper talks about Spark’s programming model and resilient distributed dataset (RDD). The Spark has a driver program that implements the high-level control flow of the application and launches in parallel. Two main abstractions that Spark provides are resilient distributed datasets and parallel operations. RDD is a read-only collection of objects partitioned across a set of machine that can be rebuilt if a partition is lost. RDD is constructed in four ways, including file, parallelizing a collection, transforming an existing RDD, changing the persistence of an existing RDD. In addition, the parallel operations that can be performed on RDDs are reduce, collect, and foreach. Spark also provides shared variables for users such as broadcast variables and accumulators.
Second, the paper talks about the implementation of Spark. Spark is built on top of Mesos, a cluster operating systems that lets multiple parallel applications share a cluster and provides an API for applications to launch tasks on a cluster. Each RDD object implements the same simple interface, including getPartitions, getIterator, and getPreferredLocations. When a parallel operation is invoked, Spark creates a task to process each partition of the dataset and sends these tasks to worker nodes. This paper did three experiments that uses Spark framework, including logistic regression, alternating least squares and interactive spark.
The strength of this paper is the clear organization of introducing Spark. It first talks about the motivation and the programming model of Spark. Then, it describes the implementation of Spark and some experimental results. This flow can make the readers clear about why we need it and how we can use it.
The weakness of the paper is that it does not provide many examples when illustrating ideas related to Spark. Since Spark is new to many readers, I think it would be better to provide examples when explaining the model and the implementation of Spark.
To sum up, this paper introduces Spark, which support various applications of MapReduce that are not suitable for acyclic data flow model, such as many machine learning algorithms.
This paper is an introduction to Spark, which is an extension to MapReduce that handles the subset of problems where the working set of data needs to be reused for the next iteration. MapReduce did not hand this as it was a straight through process where the data won’t change and no next iteration would be needed. |
A big focus of this paper is Resilient Distributed Datasets(RDDs) that are a read-only collection of objects that are partitioned across a set of machines that can be rebuilt if a partition is lost. This means in the event of a failure all data can be rebuilt from data that is reliably on storage, even though the RDD data itself does not exist in physical storage.
The implementation honestly appears to be very similar to MapReduce but with different function names (like EECS 280 students who copied each others projects but don’t want the get caught so they change variable names haha).
The strength that Spark offers is that it is much faster at iterations due to it’s caching setup than MapReduce. The results section shows that it is slower for a one iteration problem (than Hadoop) but around an order of magnitude faster once around 30 iterations is needed for convergence.
There were two main weaknesses I saw in this paper:
1.) I do not think it did a good job of explaining what types of problems have a need for this and why MapReduce can’t be used. I think this was crucial to the paper as the whole point of Spark is to use it on things that don’t work well in MapReduce and the example that they gave about text searching seemed very similar to what was used in MapReduce. I think this was a major weakness of the paper, but maybe it was a me issue just not understanding it.
2.) The visuals were weak for this paper. I don’t think they really added to the explanation (especially the pseudocode) and I think they could have done a much better job with that.
Overall, I think this was a decent paper but it could have done a better job of explaining the need for Spark. The technical differences from MapReduce seemed relatively small to me and I wasn’t clear on what the breakthrough idea of Spark that wasn’t already existing. It was very impressive the results they showed at the end and Spark appears to be very useful and the paper was a success, but I felt it could have been better.
In the same vein as MapReduce (and a system named Dryad is also mentioned), Spark aims to provide a layer of abstraction for large data processing in a streaming-network model. These previous technologies offered a way to perform “one-shot” computing, i.e. data that flows in one direction. However, in many applications (such as industrial machine learning algorithms like k-means, regression, expectation maximization or data mining) require data to be re-used, even across parallel iterations of some data processing task. Spark attempts to provide a system that takes this into account, while retaining failure tolerance and scalability (much like the MapReduce architecture).|
Spark relies mainly on the idea of Resilient Distributed Datasets (RDD’s), which are, in effect “parallel” arrays of data, loaded from distributed file systems like Hadoop. These are lazily constructed pieces of read-only data distributed across multiple partitions which can be reconstructed from files, effects of computation on other RDDs/operations, or cached versions of previous data (hence the resilience). Specifically on that last bit, by caching as much of the data as possible, they are greedily optimizing computation by going through cached memory first on repeated/subsequent computations. Indeed, RDDs also lend themselves handily to operations similar to map/reduce/counting operations, and can even be persistently kept to disk should the level of stringency need to be elevated.
Essentially, Spark looks like a handy tool, but I think one of the troubles I had wrapping my head around was why it was presented as so amazing and novel. It looks like the two main points were that they could provide a good abstraction for distributed computation a level above MapReduce, and that if you just cache what you are working on, it provides a significant improvement. That being said, the way I interpreted everything, it just looks like MapReduce with caching sprinkled on top.
This paper explains about Spark. The motivation behind it is because MapReduce is still considered deficient when running iterative jobs (applying a function repeatedly to the same dataset) and interactive analysis (running separate explanatory queries on a dataset).|
The paper talks about Spark’s programming model. Spark provides three simple data abstraction for programming clusters: resilient distributed datasets (RDDs), and two restricted type of shared variables (broadcast variables and accumulators). The RDD represents a read-only collection of object partitioned across a set of machines that can be rebuilt of the partition is lost through the notion of lineage. It can be constructed in four ways: from a file in shared file system, parallelizing a collection in a driver program, transforming an existing RDD, or changing the persistence of existing RDD. Parallel operations (reduce, collect, foreach) can be performed on RDD. Broadcast variable is used if a large read-only piece of data is used in multiple parallel operations. Accumulator is the variable that can only be added to be used in associative operations, and only the driver can read (this variable is also fault tolerant since the only operation is “add”). Next, the paper goes with example of Spark program by showing several codes for logistic regression and alternating least squares. The next section explained Spark implementation (on Scala), which at its core is implementation of RDD. The datasets are stored as a chain of objects capturing the lineage of each RDD , in which when failure happens can be used to rebuilt the datasets. If a node fails, its partitions are re-read from their parent datasets and eventually cached on other nodes. Just like in MapReduce, tasks are shipped to workers. There is a brief explanation on how to integrate Spark into Scala interpreter. The next section shows the result of the implementation as well as comparing it with Hadoop’s result for logistic regression, alternating least squares, and interactive query. Last, the paper discusses related work and how they inspire/compare to Spark.
One major contribution of the paper is presenting a framework (Spark) that supports application with working set of data across multiple parallel operations while retaining MapReduce’s scalability and fault tolerance. Relying heavily on the use of memory, Spark helps a lot when you need a fast analytical tool that process stream of data (unlike Mapreduce that copies on local disk, making the process slower compared to Spark).
However, Spark relies a lot on memory processing. There is always a possibility when data is so big it does not fit in the memory, how would it affect Spark’s performance? The memory block size could be enlarged, but bigger block size would mean longer recovery time in case of fail nodes? How would the Spark takes care of this thing, because The paper does not seem to mention this.--
The purpose of this paper is to present a database system for queries on large datasets that allows the user to more effectively execute cyclic queries or queries that require the same working set at different nodes in the distributed system (whereas MapReduce focuses entirely on acyclic queries). |
The paper has several main technical contributions. One of them is their definition of the Resilient Distributed Dataset (RDD) which is a partitioned across several machines, is a read-only collection of objects, and is resilient to partition failures (i.e. if a partition fails or is lost, the data that was on that partition can be easily regenerated). This framework also provides a set of parallel operations that the users can use in the programs they write. It does seem to be the case that SparkDB provides less abstraction than MapReduce does, but I think the increase in performance they demonstrate makes the use of Spark appealing enough. They offer three parallel operations similar to map and reduce: reduce, collect, and foreach, however their reduce operation does not support grouped reductions yet. They also present two kinds of shared variables that can be used in the shared working sets called broadcast variables and accumulators. Finally, the paper presents their implementation overview as well as some empirical results that demonstrate how Spark outperforms Hadoop (with MapReduce) on specific machine learning queries.
This paper has several strong points. I think that the examples presented as good and help to clarify how the new RDDs they’ve introduced can be utilized in solving some common machine learning queries. Additionally, I think that the related works section does a good job of pointing out how Spark is similar and different from existing similar systems so that it is more clear to the reader where exactly the innovations are in this system.
As far as weaknesses go, I wish there would have been a more thorough performance evaluation of their system. I appreciate the results on the specific machine learning algorithms, but I wish that there were more results regarding the general performance of the algorithm as well as some results that explore the tuning of the parameters for Spark.
Review: Spark: Cluster Computing with Working Sets|
This paper presents the algorithm of Spark, which is an advanced version of MapReduce that can handling cases where data are iteratively used. The motivation of this work comes from the limitation that thought being successful in implementing large-scale data-intensive applications, most of MapReduce systems cannot handle cyclic data flow, which is a common case in many machine learning models. Spark attempts to solve this issue by creating an abstraction that is a read-only collection of objects distributed over a set of machines that can be rebuilt if a partition is lost.
The strength of the model is that the elements of an RDD need not exist in physical storage and can be computed from the information in a handle. Therefore it can be redone if a node fails. The paper is clear and detailed in explaining the programming model by providing examples, pseudo codes and diagrams. The experimental results also show promising signs of the model’s advantages.
However the model is still a prototype by the time the paper was published. In the experiment section, there is not much of comparisons of the proposed model with other models. It would be a better demonstration if some comparisons can be done between Spark and MapReduce on the tasks that MapReduce is also capable to perform. Besides, some more detailed descriptions of the data and tasks used in the experiments may also help readers to get a more convincing understanding of the model's capability. Besides, by changing those configurations and run multiple tests for comparison the paper could have also made more convincing demonstration of the model's performance.
This paper introduces spark, a cluster computing engine that support acyclic data flow model (DAG computing) and enable fault tolerance by capture the data lineage. Spark has been very successful in the area of Big Data since it has amazing performance (outperform Hadoop in iterative algorithm by 10X) and its being the first big data computing engine that support low latency large scale interactive computing.
The main abstraction of data model in Spark is Resilient Distributed Dataset (RDD), which represents a read-only collection of objects that distributed in multiple machines with each distribution as a partition of the who RDD. Each partition of the data can be rebuilt from its parent data and the associated computation. RDD, as a constrained shared memory abstraction, balanced between expressiveness and scalability and reliability. RDD can be constructed in 4 ways: 1. From a file in a shared file system like HDFS, 2. By partitioning a Scala collection 3. By transforming an existing RDD, 4 By changing the persistence level of the existing RDD. Moreover, RDD has 3 interface, 1. getPartitions, which returns a list of partition IDs; 2. getIterator(partition), which iterates over partition, 3. getPreferredLocations which is used for task scheduling to archive data locality. Spark archive data locality by using a technique called delay scheduling. Moreover, Spark highly rely on the features in Scala programing language, where it utilize its closure serialization and the functional programming features to archive distributed computing and its interpreter to archive interactive distributed computation.
On the other hand, the main abstraction of parallel computation is a set of parallel operations that can operate on RDD: 1. Reduce combines dataset element using an associative function, 2. Collect that sends all element of the dataset to the driver program, 3. Foreach that passes each element through a user provided function. These operations are also operations supported by Scala. Spark also create 2 shared variable to suite the distributed computing scenario: 1. A broadcast variable, which would be delivered to every worker and performed a role like lookup table, 2. An accumulators, where workers can only add to using an associative operation, and only the driver can read.
This paper also has a great comparison experiment with Hadoop MapReduce, and similar to the MapReduce paper from Google, this paper also introduce the way to use Spark by introducing several algorithm can be archived in Spark. It is noticed that at the time this paper is published, Spark still have several short-comes due to its short development period. However, those problems are noticed by this paper and are discussed in the future work section.
1. This paper introduces Spark, a large scale distributed DAG computing engine that support fault tolerance, lazy evaluated cache across clusters and iterative operation as its nature. Spark has been considered as the next generation big data framework since its first release.
2. This paper, similar to the MapReduce paper release by Google, introduces the usage of Spark and the Spark programming model with several expressive example, which is very helpful for its reader to understand.
3. Spark effectively utilized the features in Scala, a functional programming language run on JVM and is solely design to build efficient scalable application. By using multiple features in Scala, Spark archive multiple impressive features with amazingly small codebase.
1. At the stage of the paper being published, spark has several short comes due to its short development period. For example it does not support a group reduce function in nature, and the shuffle operation is not supported. Moreover, it does not have its own multicast system which adds a lot of overhead by using Scala built in sterilization functionality, or cost a lot of labor to write a multicast system for each drier application as described for the ALS example.
2. Although it is great that Spark archive multiple features utilize the nature of Scala, this fact also make it constrained by the Scala programing language community. As it mentioned in the paper, since they use a modified version of the Scala interpreter, a careful coordination with the Scala community is needed to make the future development of Spark more smooth.
3. RDD as a shared memory abstraction, is actually sacrifice the system throughput with latency. In this case, in contrast with MapReduce, which is more suitable for batch process, Spark is more suitable for iterative algorithm and iterative computing.
This paper introduces Spark, an cloud computing framework inspired by MapReduce. Spark optimizes for the work load that reuses working set of data across multiple parallel operations. To achieve this goal, this paper proposes an abstraction called resilient distributed datasets (RDD). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. A Spark workload is an acyclic directed graph of RDDs. Unlike MapReduce which materialize all intermediate results on disk, Spark keeps them in memory and recompute them in case they are lost. In addition, Spark also provides two restricted types of shared variables: broadcast variables and accumulators.|
The main contribution of Spark is the proposal and implementation of RDD. The RDD largely release the fault-tolerance model of MapReduce as it can be re-computed whenever it is lost. By taking advantage of this property, Spark stores the intermediate result in memory, thus significantly reduces the I/O overhead. Another advantage of Spark is to make RDDs laze evaluated. In a sequence of operations, the actual file read and computation work are delayed until the point when result is requested, e.g. the reduce function is called. This optimization avoids unnecessary data transfer and reduces the memory usage.
There is no deny that this paper did a great work on most of the aspects. While I doubt the decision of deferring the caching mechanism to programmer. Though in a simple DAG of RDDs the programmer may have a better view of what should be cached, in real world analysis work load that spans a large DAG, deciding the RDD to cache will be a tedious work. It is feasible to let Spark have an automatic mechanism that caches intermediate result though iterations. The algorithm can be speculative based the real work load.