Review for Paper: 25- Spark: Cluster Computing with Working Sets

Review 1

Mapreduce has been widely used in modern cluster-based data applications. However, the acyclic data flow model is not suitable for applications that reuse a working set of data across multiple parallel operations. The paper proposes Spark, a new framework that supports these applications while retaining scalability and fault tolerance of MapReduce.

Spark introduces resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. The paper first describes Spark’s programming model and RDDs. Then it shows some example and the implementation including their integration into Scala and its interpreter. Finally, the paper presents the experimental results and future works.

Some of the contributions and strengths of this paper are:
1. User can cache RDD and reuse it in multiple parallel operations, which achieves the scalability of MapReduce.
2. RDDs also achieve fault tolerance by lineage, which means RDD has enough information about how a failure partition is derived from other RDDs to be able to rebuild that partition.
3. Spark reduce disk I/O when algorithm applies a function repeatedly to the same dataset.

Some of the drawbacks of this paper are:
1. Spark does not have its own file management system. Thus, it relies on some other platform like Hadoop or other cloud-based platforms.
2. Spark does not have higher-level interactive interface, building an interface on top of the Spark interpreter using SQL and R shells may be helpful.
3. Spark has no support for real-time processing of live data.



Review 2

Most of MapReduce systems are built around an acyclic data flow model that is not suitable for other popular applications, such as applications which reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. Therefore, this paper proposed a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce with an introduction of abstraction called resilient distributed datasets (RDDs).

The Spark framework consists of a driver program, which implements the high-level control flow of their application and launches various operations in parallel, and two main abstractions for parallel programming: resilient distributed datasets and parallel operations on these datasets (invoked by passing a function to apply on a dataset). In addition, Spark supports two restricted types of shared variables that can be used in functions running on the cluster.

The main contributions of this paper is the Spark framework powerful enough to express several applications that pose challenges for existing cluster computing frameworks, including iterative and interactive computations. Besides, the core idea behind RDDs, of a dataset handle that has enough information to (re)construct the dataset from data available in reliable storage, is useful in developing other abstractions for programming cluster.

The main advantages of this model are:
1. It starts with the same concept of being able to run MapReduce jobs except that it first places the data into RDDs so that this data is now stored in memory so it’s more quickly accessible i.e. the same MapReduce jobs can run much faster because the data is accessed in memory.
2. It supports Real time stream processing and graph processing.
3. Code can be reused for batch-processing, join stream against historical data or run ad-hoc queries on stream state.
4. Through RDD, it provides fault tolerance. Spark RDD are designed to handle the failure of any worker node in the cluster, which ensures that the loss of data is reduced to zero.

The main disadvantages of this model are as follows:
1. Spark does not support a grouped reduce operation as in MapReduce.
2. It is expensive: In-memory capability can become a bottleneck when we want cost-efficient processing of big data as keeping data in memory is quite expensive, the memory consumption is very high, and it is not handled in a user-friendly manner. This model requires lots of RAM to run in-memory, thus the cost is quite high.
3. Job requires to be manually optimized and is adequate to specific datasets.
4. This model does not have its own file management system, thus it relies on some other platform like Hadoop or another cloud-based platform.


Review 3

Problem & motivations:
MapReduce is a powerful template or protocol for distributed computing. However, one major disadvantage is that it does not cache or store the intermediate data or the raw input data in memory. The MapReduce will partition the data and store them back to disk.
Therefore, for applications needing to reuse the previous data many times, it will be really inefficient. The examples provided in the paper are iterative jobs like the GD of machine learning and interactive analytics.

Main Contribution:
The key insight behind the paper is really simple. Rather than pushing all intermediate results or raw data back to disk, the spark provides the user with the interface that allows the user to cache the reused data.
The way the spark to cache the data is through the RDD. From my perspective, RDD is likely to use the parity bits to ensure the integrity. And it provides two restricted type of shared variables - can only read or append & (read by the driver). MapReduce does not such a function.

Main Drawback:
The user itself needs to figure out which part of data will be reused. If it a compiled language, it will be better for the program to auto figure it out.



Review 4

Performing computation efficiently on large datasets is an important task that has important implications in today’s era of big data. To this end, many methods have been introduced, including the widely-used MapReduce method. While MapReduce is able to automatically handle parallelizing and distributing computational tasks while achieving scalability, and fault tolerance, factors that made it so successful, it has its limitations. For example, some workloads, such as machine learning, involve repeatedly applying some function to the same dataset in order to optimize parameters. Each iteration would require a separate MapReduce job, which is quite inefficient when the data has to be reloaded from disk each time. This same issue of redundant reads from disk happens in Hadoop when doing exploratory data analysis, since each query requires a separate MapReduce job (as opposed to loading data into memory and doing repeated queries). To address this deficiency, a new computing framework called Spark is introduced in this paper that aims to deal with this cyclic data flow problem while maintaining similar scalability and fault tolerance.

Spark introduces the concept of a resilient distributed dataset (RDD), which is a read-only collection of objects partitioned across a set of machines, which can be rebuilt if a partition is lost. This allows data to be reused in the case of repeated operations on the same data. RDDs are expressed as Scala objects, and support the parallel operations reduce, collect, and foreach. In Spark, programmers can use map, filter, and reduce on their data, similar to MapReduce, as well as broadcast variables to ensure values are copied only once to each worker (rather than every time a closure is run) and accumulators. Spark is built on top of Mesos, allowing it to run alongside other cluster computing frameworks such as Mesos ports of Hadoop and MPI. When a parallel operation is invoked on a dataset, Spark starts tasks to compute each partition of the dataset, which are then assigned to workers using delay scheduling. Workers then get the data using an iterator. Broadcast variables are saved to file in a shared file system, and are read by workers if it is not already in local cache. Accumulators are given unique IDs on creation, and the serialized form contains its ID and “zero” value for whatever datatype it is using. Workers have a separate copy of the accumulators and send information about updates to various accumulators that is applied to the master set, taking care to avoid duplicates due to reexecutions after failure. Spark is also integrated into the Scala interpreter, though details are short in this paper due to space limitations.

The main strength of this paper is that it identifies and addresses an important deficiency in MapReduce for a very important class of tasks, repetitive computations on the same data such as in machine learning. In doing so, Spark was able to achieve promising performance results. As experimental results show, Spark ran significantly faster for a logistic regression job, with speedups in the area of 20x and higher, compared to Hadoop, and the performance advantage grew as the number of iterations increased. Again, this is due to being able to greatly reduce or even eliminate unnecessary disk accesses for data that was intended to be reused.

While the results for Spark are certainly impressive, it does suffer from very high memory usage due to its design where RDDs are continually stored in memory to speed up computation. Additionally, the use of Scala, while it does have some advantages, also trades off some potential performance, as seen in the case where the number of iterations was 1. In that case, Hadoop actually outperformed Spark. Finally, a weakness specific to this paper is that authors did not comprehensively address the issue of fault tolerance, despite their stated goals to create a system that was as robust to failure as MapReduce. Again, this is probably a result of paper length requirements rather than any inherent faults with the system.


Review 5

The main contribution of this paper is Spark, which is a distributed computing scheme that addresses some of the shortcomings of some of its predecessors like MapReduce, such as the inability to handle workloads that cycle data across multiple parallel operations. The paper introduces the contribution of RDDs, which are read-only resilient-distributed datasets. These allow the user to recover from failure by following a lineage, which is a log that can be used to rebuild the data. RDDs overcome the limitations of its predecessors of expensive reads in the face of workloads that involve iterative jobs or interactive analytics, which look at the same data multiple times. Spark still allows locality aware scheduling (to limit network bandwidth usage), fault tolerance and load balancing. The paper also claims that Spark is the first general purpose programming language of its kind to interactively process datasets on a cluster.

As far as the coding model of Spark, the user must write a driver program that launches the tasks in parallel that are combined in your final result. We then look into the contribution of RDDs, which are read-only distributed datasets. RDDs do not need to actually exist in storage but can actually be a lineage of instructions to rebuild them from whatever data is in reliable storage. RDDs can be constructed in many ways that essentially partitions a large dataset across multiple nodes. We look into the supported parallel operations like reduce, collect, and foreach, each of which processes data in parallel and combines the result together in some way. We next look at shared variables, which come in the form of broadcast variables like lookup tables that are read by multiple distributed nodes and accumulators like counters.

Next the paper looks into some example computations that can be completed using Spark and breaks down the driver program code that would be written. The first example for error text search closely resembles MapReduce since it sums over a logical array while the other examples for logistic regression and alternating least squares take advantage of the iterative features in Spark. Spark is built on top of the “cluster operating system” Mesos and integrated into Scala. Shared variables are implemented by saving the data in a shared file system as a serializable class which includes a pointer to the data in the shared space. The authors also modified the Scala interpreter which typically processes each line of code as a class. They wrote the outputs of the interpreter to a shared filesystem and directly referenced previous lines’ singleton objects in each subsequent lines’ singleton object.

I did not like how the paper covered related work at the end instead of the beginning. I appreciated their brief mention of MapReduce for context, but I wished they surveyed the state of the art to more thoroughly provide context for their contribution. I did like how the paper was very clearly styled, where my eyes could group new terms with their definition and see what parts of each section were parallel.





Review 6

Spark

This paper introduces Spark, which is another famous distributed system that came after MapReduce and came out with the aim of resolving MapReduce's drawbacks. As is known that MapReduce is designed specifically in the scenario of non-iterative data model and when dealing with cyclic data flow the performance is low. Also, with the booming advent of machine learning, deep learning framework and the corresponding requirement, the need of efficiently processing large acyclic data flow arises. Therefore, Spark introduces resilient distributed datasets which is a read-only distributed dataset. Spark supports the iterative data flow as well as maintain the extensibility, fault tolerate and load balancing from MapReduce. This paper summaries that Spark can achieve 10 times efficiency as MapReduce when processing iterative data processing(machine learning)

Resilient distributed datasets is the core of Spark which each of them is represented by a Scala object and can be construct in four ways: 1,From a file in a shared file system. 2, Parallelize a Scala collection in the driver program. 3 transform an existing Resilient distributed datasets. 4 change the persistence of an existing Resilient distributed datasets. Furthermore, some parallel operations can be achieved on Resilient distributed datasets which includes reduce, collect and foreach. With regard to shared variables, programmers could invoke operations like map, filter and reduce by passing closures to Spark. This paper also shows the promise of Spark by 3 concrete examples: Logistic regression, alternating least squares, and interactive spark. The the corresponding experiments show that Spark achieves much better performance than MapReduce(Hadoop)

The main contribution of this paper is the three simple data abstractions for programming clusters which is resilient distributed datasets and two restricted types of shared variables which are broadcast variables and accumulators. And it fundamentally resolve the low performance brought by frequently data flow interaction with disk in MapReduce. It minimizes the expensive I/O and processing mostly the data flow in the main-memory.

Obviously there are several advantages of Spark: 1. Fast speed: compare with MapReduce, main-memory based Spark processes much faster than Hadoop, also disk-based computing is faster than Hadoop. Spark can efficiently process data flow. 2, Spark is easy to use: Spark supports API of java, python and Scala which is friendly to prototype development. 3, Spark offers better solution on machine learning, online streaming and query which is more general than Hadoop.

There are also some drawbacks with regard to Spark: different from Hadoop, Spark executes operations using multi-threading where Hadoop uses multi-processing which will induce that the distribution of tasks is not ideally even. Spark is not very ideal for stable service but is only good at computing. However, overall speak compared with Hadoop, spark is a successful model which is designed to resolve the problems that occur in MapReduce.


Review 7

This paper proposed a new computation model named spark. Spark is different from MapReduce. The spark used RDD, derived from DSM, so that the intermediate results could be stored in-memory to reduce the disk r/w overhead.

Spark is a big data parallel computing framework based on memory computing. Based on memory computing, Spark improves the real-time performance of data processing in a big data environment, while ensuring high fault tolerance and high scalability. It allows users to deploy Spark on a large number of inexpensive hardware to form a cluster.

The advantages of Spark over Hadoop MapReduce are as follows.

MapReduce-based computing engines typically output intermediate results to disk for storage and fault tolerance. Spark abstracts the execution model into a generic directed acyclic graph execution plan, which allows multiple Stage tasks to be executed in tandem or in parallel without having to output Stage intermediate results to HDFS.

MapReduce spends a lot of time sorting before data Shuffle, and Spark can alleviate the overhead caused by the above problems. Because Spark tasks do not need to be sorted in all scenarios in Shuffle, Hash-based distributed aggregation is supported. A more general task execution plan) is used in the scheduling, and the output of each round is cached in memory.

The contribution of spark mainly falls on its solution to big data problem that can replace mapreduce and solve the pain point of mapreduce. Also, It achieves significant speedups on common big data tasks. Also, spark provide a faster scalable fault tolerant algorithm for iterative algorithms. Spark supports the fast programming of applications using Java, Scala, and Python languages, and provides over 80 advanced operators, making it easy to write parallel applications. The multi-frame sharing resource mode effectively solves the problem that the peak time task is relatively crowded due to the imbalance of the number of applications, and the idle time task is relatively idle.

The weak point of this paper is that the the paper only compare spark with hadoop, but not other on going distributed systems. The result would be more interesting and convincing if compared to other projects.





Review 8

The paper starts by describing a limitation of mapreduce: it only works for data that can be modeled in an acyclic data flow graph. Spark is an extension of mapreduce allows intermediate datasets to be reused by multiple operations, making it ideal for iterative machine learning and data analysis. The main concept introduced is the RDD - the way that data is stored to be reused. It is kept partitioned across machines, and if a machine fails, it can be recreated. RDDs can be created in a number of ways, including being created from a file or another RDD. For certain jobs, spark can improve performance (in terms of speed) up to 10x.

This is a strong paper that very clearly describes what its main contribution is - RDDs. Presumably this was written at a time when extensions of mapreduce were not a novelty at all. I had heard of Spark many times, but besides associating it with the word “streaming,” I really had no clue how it was different from mapreduce. The paper makes it immediately clear and gives concrete reasons that made me understand when it would be useful to use spark instead of vanilla mapreduce.

Compared to mapreduce, I noted that Spark requires more thought from the user. One example is that it is mentioned that the user can alter the persistence of the RDD through the cache and save actions. This seems good conceptually, but I did feel as though it got rid of a bit of the simplicity in mapreduce that makes it a good abstraction. Mapreduce is great because users don’t need to think about parallelism, in a traditional sense. Additionally, I thought that the way examples were presented in this paper was not as strong as their presentation in the mapreduce paper. In particular, some examples required a significant amount of setup. It seems as though there should have been some examples that would be common knowledge, or that these examples (such as simple logistic regression) could have been assumed as common knowledge. Overall, despite my critiques, I thought that this was a great paper, but it had clear space limitations and seemed to be written pretty early in the evolution of spark as a framework.



Review 9

This paper introduces Spark, a programming framework designed specifically for applications that reuse a working set of data across multiple parallel operations while retaining the scalability and fault tolerance of MapReduce. Although MapReduce is quite successful, the problem with this model is that when the application wants to reuse a working set of data, each MapReduce job must reload the data from disk, incurring a significant performance penalty. Spark solves this problem by introducing an abstraction called resilient distributed dataset (RDD).

RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. The rebuilding is possible since internally, an RDD saves a chain of objects that describe how it can be derived from its parents. If a node fails, the system can follow the lineage of those lost partitions and re-create them. By default, RDDs are lazy and ephemeral. This means partitions of a dataset are materialized on demand and discarded from memory after use. However, it’s possible to cache a dataset so that later operations do not need to load the data from disk again.

There are a set of operations that can be applied to an RDD. Typically, an RDD is first created from files and then transformed to another RDD through flatMap, map or filter operations. Change the persistence of an existing RDD (cache or save) will also create a new RDD. Internally, these operations just create a lineage chain specifying the parent and what the transformation is. Other operations include: reduce (combine dataset elements using a function), collect (retrieve all dataset elements) and foreach (execute a function for all elements).

The final piece of Spark is shared variables. There are two types of shared variables: broadcast variables and accumulators. The first one is useful when a large read-only piece of data is used in multiple operations, it is wasteful to send the same data over the network to the worker multiple times. Broadcast variable ensures the data is copied to each worker only once. Accumulators can be used to implement counters and provides fault-tolerance.

One downside of this paper is that the evaluation part is very limited. For each experiment described, only one trial is mentioned. There’s no figure show how the performance of Spark varies as the size of the dataset or the hardware configuration changes. The performance of Spark on large datasets or large clusters is also not clear.





Review 10

In the paper "Spark: Cluster Computing with Working Sets", Matei Zaharia and Co. develop Spark, a system similar to MapReduce that focuses on reusing working sets of data across multiple parallel operations. MapReduce has become increasingly popular, but is deficient in two use cases that Spark deems to be important: Iterative Jobs and Interactive Analysis. For Iterative Jobs, many machine learning algorithms try to optimize a parameter through multiple iterations of a dataset. Reloading data from disk has a huge penalty and decreases performance. Likewise, for Interactive Analysis, HCI experts assert that humans have a short attention span; users expect times of under two seconds - otherwise we lose interactiveness. Current methods such as Hadoop have latency of up to 10 seconds because they run a MapReduce job and read from disk. Zaharia wants to optimize these two use cases and at the same time, retain scalability and fault tolerance. Since there are programmers that deal with this type of work, it becomes clear that this is a problem worth investigating.

The main contribution of this paper is the resilient distributed datasets (RDD). These are read-only collection of objects that are partitioned across sets of machines and can be rebuilt if a partition is lost. Since the RDDs don't exist in physical storage and a handle contains information about their lineage, they can always be reconstructed from other RDDs if a particular node fails. RDDs are constructed in 4 ways:
1) From a file in a shared file system
2) By parallelizing a Scala collection in a driver program
3) By transforming an existing RDD
4) By changing the persistence of an existing RDD
This is done so that Spark programs work at reduced performance in the case of failure of nodes or if the dataset is too large. The goal is to have trade offs between cost of storing an RDD, speed of accessing it, probability of losing part of it, and cost of recomputing it. Furthermore, several parallel operations are supported:
1) Reduce: Combines dataset elements with an associative function to get a result in the driver program.
2) Collect: Sends all elements of a dataset to a driver program.
3) Foreach: Passes each element through a user function.
Currently, this parallelization only occurs on process at a time - unlike MapReduce which offers support for grouped reduce. Lastly, Spark allows users to create restricted types to support simple and common usage patterns:
1) Broadcast Variables: Enables large reads to be distributed over workers only once.
2) Accumulators: Used to implement counters like MapReduce and provide syntax for parallel sums.
Workflow: Parallel operation on a dataset -> partition and send task to workers using delay scheduling -> each worker starts reading partitions -> shipping tells workers to ship closures to them.

Much like other papers, this has some drawbacks. The first drawback I see is how they were unable to reach their goal for interactiveness. They take a huge initial cost on the first iteration, compared to Hadoop, but on further iterations average around 6 seconds. This is still shy of 2 seconds, the golden standard for interactiveness. Another drawback is that Spark seems at its very primal stages. They do mention some improvements for future work, but I felt that some of the discussion about implementation was halted because of the current state of Spark.


Review 11

This paper describes Spark, a program that builds on MapReduce in order to improve on specific types of jobs that MapReduce is generally slow at. Generally, MapReduce struggles when it has to use the same data several times, like in iterative jobs or interactive analytics. In these situations, it reloads the data from disk every time. Spark includes resilient distributed datasets (RDDs), which are easy to read from several times, to solve this problem. Spark itself is implemented in Scala, which is an extension of Java.

An RDD can have one of several underlying storage mechanisms, such as a file, a Scala array, or another RDD. However, the method of storage doesn’t change how the system interacts with it. Any RDD must implement these functions, which will let the system use any RDD in the same way:
getPartitions, which returns all partition IDs
getIterator(partition), which allows the user to iterate over a partition
getPreferredLocations(partition), which get the locations that are optimal for operations on that partition.

Whenever a parallel operation is performed on an RDD, Spark generates a task to process each of the RDDs partitions, and sends each task to the partition’s preferred location. Whichever worker receives a task will then iterate through the entire partition in order to process all of its information.

Since RDDs are partitioned across multiple locations, they generalize to large distributed systems. In addition, RDDs have fault tolerance to account for failed nodes.

The major importance of Spark lies in the ability to reuse data. Since disk IOs consume a significant amount of time, reducing these is a large time-saver, and so keeping reusable data in memory gives Spark a large advantage over MapReduce. In the experiments, Spark was compared to MapReduce on Hadoop, and Spark ran from 2.8x to 35x faster on various jobs after it had loaded the relevant data once.

Some disadvantages of Spark are the need for user programming, and the similarity to MapReduce in early iterations. In the former case, the user still needs to manually construct the logic of their jobs, to an extent, unlike a SQL based system, where the entire job can be parsed and built by the system itself. In addition, the first iteration of every Spark job in the experiments seems equivalent or worse than MapReduce’s performance, which means that Spark may not be a useful substitute for the things that MapReduce already does well.



Review 12

This paper presented an In-memory cluster competing system called Spark, which is claimed faster than the current distributed computing system like Hadoop by an order of magnitude. It uses RDD, derived from DSM, so that the intermediate results could be stored in-memory to reduce the disk r/w overhead. With the lineage property of RDD, Spark doesn’t have to make disk checkpoints for the purpose of recovery since the lost data could be reconstructed with the existing RDD
Strengths:
(1) RDD greatly reduces the read/write operations of the disk comparing to Hadoop. Also there are many other optimizations to avoid disk r/w like when a node is doing broadcasting, it checks whether the data is in it’s memory first. It’s safe to say that disk r/w operations is the bottleneck for Hadoop.
(2) Based on the assumptions that recompute an intermedia partial results is faster than backup the whole result in stable data. Spark traces the data transformation logs so that it can recover results when data is lost.
(3) Good performance on iterative jobs. Machine learning algorithms applies a function on the same set of data multiple times. MapReduce reloads the data from the disk repeatedly and hence not suitable for iterative jobs;
Weak points:
(1) May perform worse than MapReduce on workloads without iterations or intensive workloads that need to store the intermedia results on disks.




Review 13

MapReduce has been very successful for many data-intensive applications but it is not designed for cyclic data model. Therefore, this paper focus on the applications that reuse a working set of data across multiple parallel operations. One example of such application is iterative machine learning algorithm. The proposed framework is called Spark, which introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. The most important aspect of RDD is that it achieves fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition. Therefore, users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations. Then the paper introduces the programming model of Spark. Spark provides two main abstractions for parallel programming: resilient distributed datasets and parallel operations on these datasets. There are four ways to construct RDD: from a file in a shared file system, by parallelizing a Scala collection in the driver program, by transforming an existing RDD, and by changing the persistence of an existing RDD. Note that if there is not enough memory in the cluster to cache all partitions of a dataset, Spark will recompute them when they are used. There are three main parallel operations that can be performed on RDD: reduce, collect, and foreach. Spark utilizes shared variables such as broadcast variables and accumulators to make the operations more efficient. Broadcast variables are used if a large read-only piece of data is used in multiple parallel operations. Accumulators are variables that workers can only add to using an associative operation. They can be used to implement counters as in MapReduce. Then, the paper shows examples of Spark programs: text search, logistic regression, and alternating least squares. The paper continues to discuss how Spark is implemented. Internally, each RDD object implements the same simple interface, which consists of three operations: getPartitions, getIterator, and get PreferredLocations. First returns a list of partition IDs, second iterates over a partition, and third is used for task scheduling to achieve data locality. To show how well Spark works, the paper evaluates Spark on logistic regression, alternating least squares, and interaction Spark. The result shows Spark's promise as a cluster computing framework.

The advantage of this paper is concise with reasonable amount of background introduced for MapReduce. It first introduces the main features of Spark and why those are important and then gives examples of such applications. This sequence is helpful for readers to understand Spark.

The disadvantage is that there should be more background about MapReduce. I myself knows MapReduce, therefore it is not hard for me to understand this paper. For those who do not know MapReduce, this paper may not be so clear.


Review 14

“Spark: Cluster Computing with Working Sets” by Zaharia et al. describes Spark, a new cluster framework for large-scale, distributed computing. Prior systems like MapReduce and its extensions similarly support large-scale, distributed computing but were tailored to acyclic data flow applications, where data was processed once and not reused throughout the lifecycle of a longer application. Spark was designed to support applications that do reuse data over time, for example iterative jobs and interactive analytics. Spark supports these use cases through a new approach of resilient distributed datasets (RDDs), where read-only data is partitioned across multiple machines for reuse and to be processed in parallel, and can be rebuilt if a machine fails. The paper describes the different ways RDDs can be constructed, different parallel operations that can be performed on them, and shared variables types they support (broadcast variables and accumulators). The paper also discusses how RDDs are implemented, in particular the lineage data structures that keep track of operations performed on data, so that the processed data can be rebuilt if a partition is lost. The authors evaluate Spark, both against the baseline of Hadoop and against different usages of itself. Spark performed quicker than Hadoop in a logistic regression when multiple iterations were run because cached data could be reused. For alternating least squares, Spark without broadcast variables resulted in much time spent resending a large data matrix across iterative jobs; using broadcast variables reduces this time.

Having read the MapReduce paper already, the Spark introduction makes it clear the disk-read challenges of MapReduce for iterative jobs and interactive analytics use cases. Also, the code examples in section 3 make how Spark works more concrete.

I think it could have been clearer how the data partitions are made; perhaps a lower level comparison of data partitioning in MapReduce vs Spark would have helped. Also, although it seems intuitive that applications that need not use particular data multiple times would have better performance using MapReduce, it might have been nice to see some experimental results that show exactly what applications do better with MapReduce than with Spark.


Review 15

This paper proposed Spark, which is a new framework focused on applications not suitable for an acyclic data flow model while retaining the scalability and fault tolerance of MapReduce.

The main contribution of Spark is an abstraction called resilient distributed dataset(RDD), it is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Another abstraction for parallel programming is parallel operations. Different from MapReduce, the "reduce" function of Spark didn't support grouped reduce.

The strong part of the paper is that it is very clear and easy to read. The structure of the paper is good. For such a technique paper, I really like to read the examples given in the paper. It will make their technical point clear and help readers to understand what is going on and why they do it in such a way.

The weak part I want to point out is about the technique itself. The communication between different cluster nodes is difficult so that it is difficult to implement sophisticated algorithms which need to share state or information between nodes. I think it's the same problem with MapReduce. Also, it is reported online that "Spark also experiences significant object churn when engaged in computationally demanding activities like machine learning. I have seen garbage collection bring calculations to a halt on several occasions.". I didn't use Spark on my own, so I do some search online to see its limitations.


Review 16

In this paper, the authors proposed a novel programming framework called Spark. Spark is proposed to solve some problem that had been seen in traditional MapReduce model proposed by Google. In this paper, they point out that there are many applications that involve making multiple passes over data like machine learning algorithm. However, with traditional MapReduce, we have to do many reads and writes from disks which means a great bunch of I/O requirements. Besides, MapReduce is also not very helpful for interactive queries, when you want to use MapReduce, you have to write a job and then run that job to get a result. Even if you can use some system like Hive to convert SQL to MapReduce jobs, the overhead is still large due to the I/O cost. Spark is here to rescue which focuses on the performance of the applications that reuse a working set of data across multiple parallel operations. In the next part, I will summarize the crux of the design of Spark with my understanding.

The most important data model introduced in Spark is the Resilient Distributed Dataset (RDD) which is a variable representing a collection of data objects that are distributed across many different machines. RDDs can be initialized from several data sources. Just as regular variables, the Spark RDD type contains operators that can be used for operations. RDD operators are parallel operations like foreach or reduce functions which call an associative function to combine the elements of the RDD into the result. The MapReduce operations can be simulated by utilizing these operators. However, unlike MapReduce, these RDDs act like variables and persist. Thus, RDDs can be used in an iterative way. The user is able to provide hints to the system that some RDDs will be reused, making Spark to cache it into main memory so that they can be accessed more efficiently next time. This can increase the speed greatly. The RDDs are often created by the operations on other RDDs, thus each RDD can keep track of its parent RDDs that create itself. This increase the reliability of RDD because they know how the rebuild the lost data. Besides, Spark is implemented in Scala which is a statically typed language runs on JVM. Scala has a good support for functional programming which makes the Spark programming much easier. In the experiment part, we can find that Spark can outperform Hadoop by 10X in iterative machine learning jobs, which is very impressive.

The Spark is definitely a very successful opensource product and still a very popular platform. There are several advantages which make Spark become successful. First of all, Spark is easy to implement and use, software engineers can save their time by utilizing the Spark, they can avoid most unnecessary programming for big data processing at scale. Next, in Spark, they use main memory and caches a lot. In this way, they reduce the I/O problem in traditional MapReduce model and improve the efficiency, all intermediate results are stored thus making a sequence of operations faster. Besides, Spark supports interactive queries on data that make it more friendly for users.

There are some disadvantages to this paper. Although Spark greatly utilizes the RAM and cache, it doesn’t say how to deal with memory limitation. If the memory is not enough for some workload, they do not say how the Spark can perform in such a scenario. I think Spark use some kind of shared memory abstraction, this will make a tradeoff between throughput and latency. As a result, I think Spark maybe not so good on batch processing algorithms.




Review 17

This paper introduces Spark, which is a framework that accomplishes the same task as MapReduce, which is to parallelize operations for the developer. However, its main contribution is to improve upon MapReduce by supporting operations that don’t fit into the rigid acyclic flow that MapReduce operations require. In particular, it provides support for keeping a “working set” of data that can be efficiently re-used by the operations involved in a Spark run. It does this by introducing the notion of resilient distributed datasets (RDDs). An RDD is a read-only set of data that is partitioned similarly to how data is partitioned in MapReduce, but RDDs are more efficiently distributed because any particular partition of an RDD only needs to store enough information so that it is able to *compute* the RDD when necessary. This is good for both efficiency and recovery.
By doing this, Spark allows a wider variety of operations to be performed, including some very common tasks in machine learning such as gradient descent. This is its greatest contribution, as much of the other high-level Spark concepts are very similar to MapReduce. One weakness of the paper is that Spark was still at a very early stage at the writing of this paper, so the experimental results weren’t as robust as they could have been in a more mature stage of implementation.