This paper introduces the main concepts related with MapReduce, which is a widely used programming model. The paper first introduces the overview implementation, including an overview of what different workers and master is doing, the data structures and storage devices needed, fault tolerance, and some optimizations to accelerate such as assigning workers close to file for locality and using backup tasks to cover tasks which take longer than others. Then, the paper discusses some other refinements to extend MapReduce. The one I like most is the introduction of partitioning function, which allow user to partition data manually, also the idea of combiner function is also great since it can greatly avoid network pressure. One other interesting thing is skipping bad records, which allows the system to avoid deterministic failures at run time. To show the performance of the system, the paper presents tests on a Grep and a Sort function. One thing I'd like to see is an analysis on how the run time behavior would change with changing cluster size and map/reduce worker percentages. Unfortunately, the paper doesn't do analysis on these attributes. To conclude, the MapReduce model is an easy-to-use abstraction of a parallel and distributed system. I enjoy reading this paper since it has a reasonable amount of technical details, clear examples for simple use cases of the model, good explanation of the motivation of the system architecture and refinements made, and a good analysis on the performance test. As mentioned in the previous paragraph, I would like this paper better if tests and analysis on how well the scalability is was presented in the paper. |
Google have implemented many tools in order to process large amounts of raw data like crawled documents and web request in recent years. However, the growing large data and computations have to be distributed across millions of machines and give rise to complex coding in order to cooperate distributed data and machines. Mapreduce has been designed in reaction to the computation complexity with the map and reduce function. Map function process a key-value pair to generate a set of intermediate key/value pairs while reduce function merge all intermediate pairs created by map function. The paper first presents programming model with some examples, then it describes Mapreduce interface towards cluster-based computing environment. Next section shows the experiment results on multiple tasks. Finally, the paper end with MapReduce applications on production indexing system and the discussion of future works. Some of the strength and contributions of this paper are: 1. Mapreduce encapsulate the details of parallelization, fault-tolerance, locality optimization and load balancing, which make it easier to use. 2. Mapreduce can be applied to many fields including web search services, data mining machine learning and so on. 3. Mapreduce automatically duplicate data and create multiple copies, which ensure that data won’t be lost during failure. Some of the drawbacks of this paper are: 1. Data security is a problem in Mapreduce method because the security measures are not enabled in map function. 2. Mapreduce is not efficient for small data sets. |
In order to process large amount of data at Google, computations have to be distributed across hundreds or thousands of machines so as to finish data processing in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. Therefore, this paper proposed a new abstraction that can express the simple computations but hide the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. The programming model consists of two steps: 1. Map function takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function. 2. Reduce function accepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The advantages of the model are as follows: 1. The model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. 2. A large variety of problems are easily expressible as MapReduce computations. 3. This paper developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. The implementation makes efficient use of these machine resources and therefore is suitable for use on many of the large computational problems encountered at Google. The main contribution of the paper is a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Besides, this paper also summarized the important lessons: First, restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant. Second, network bandwidth is a scarce resource. A number of optimizations in our system are therefore targeted at reducing the amount of data sent across the network: the locality optimization allows us to read data from local disks, and writing a single copy of the intermediate data to local disk saves network bandwidth. Third, redundant execution can be used to reduce the impact of slow machines, and to handle ma- chine failures and data loss. However, MapReduce model also has its limitations. 1. It is not suitable for real-time processing and streaming data, or when intermediate processes need to talk to each other(jobs run in isolation), or the process requires lot of data to be shuffled over the network. MapReduce is best suited to batch process huge amounts of data offline. 2. It's not always very easy to implement each and everything as a MapReduce program. 3. It may not be used when we can get the desired result with a standalone system. It's obviously less painful to configure and manage a standalone system as compared to a distributed system. 4. It may not work well for OLTP needs. MapReduce is not suitable for a large number of short online transactions. |
Problems & Motivations Back to 2004, Google has already started distributed computing. The thing is that for every specific target, it is a waste to rewrite a program that can achieve some general properties: parallelization the computing, distributing the data and handle failures conspire. Therefore, google eagers to have a generalized template for distributed computing and the developers can conduct the distributed computing with minimum changes. Main Achievement: The main achievement of the paper is the famous MapReduce paper. It is really well known. It has two interfaces. One the “map” function, which takes a key, value pair and maps the pair to a list of intermediate (key, value) pairs. The other is the “reduce” function, which takes the list of intermediate pairs and conduct operation on all intermediate values associated with the same intermediate key. There are three main concerns: 1. How do we parallelize the computation? The answer is that we divide the problem into several maps and reduce phases. All operations within the same phase can be run simultaneously. 2. How do we distribute the data? This is done by introducing the partition function. We can evenly distribute data for map phases worker and use the partition function to make sure for a worker of reducing phase, it receives all values associated with one key. 3. How do we handle failures conspire? We have a master who takes control of the whole system. It is responsible to assign the tasks to each worker and conduct fault tolerance. Drawbacks: Not all problems can be divided into a list of map and reduce phrases well. Consider the Distributed Grep example, the reduce phase is totally useless. Also, the master is the bottleneck point of the whole system. If the master goes down, the MapReduce will abort according to the paper (which I think the recovery will be better, but the paper believes abortion will be more better). |
Testing! |
In recent years, companies and organizations have had to grapple with the issue of how to handle ever-growing amounts of data, and more importantly, how to derive insights from them in a timely manner. One commonly used method to parallelize the task, so that subtasks are handled by other processors or computers in order to cut down on the total computation time. This can be an especially powerful technique, but often, the task of parallelizing can be daunting and time consuming for people who are not trained in programming parallel and/or distributed systems. Potentially, handling all of these implementation details could take more time than the amount saved by running the calculations on multiple machines or processors. In order to address this need, Jeffrey Dean and Sanjay Ghemawat at Google introduced MapReduce, which is a programming model that is able to automatically handle many of the challenges of parallel and distributed computing, such as data partitioning, inter-machine communication, and failure tolerance while achieving high performance. By abstracting out these low-level details, allowing programmers without specialized training to run their tasks while efficiently using resources on a large distributed system. It was designed to run on a cluster of many relatively inexpensive machines, and is highly scalable. MapReduce is split into functions, Map and Reduce, which are run in that order. Map takes in an input key/value pair and returns a set of intermediate key/value pairs, and all values associated with the same intermediate key I are grouped together. This is passed into Reduce, which processes the intermediate key/values and performs some user specified computations such as summation and returns the result(s). Intermediate values are processed with an iterator in order to deal with lists of values too large to fit in memory. As previously mentioned, MapReduce handles parallelization and distributed computing automatically, though the exact implementation can depend on the computing environment it is meant to run in. In this paper, the details are specific to the resources available at Google. First, Map automatically partitions input data into a set of M splits (16-64 MB/piece), which can be done in parallel. Calls to reduce are then distributed onto various machines by using a partitioning function to split the intermediate key/values space. Tasks are assigned to workers by the master, which also forwards the locations on disk of the data the workers need to work with. Results of each Reduce are appended to a final output file. This is done until all tasks are completed. In order to deal with possible failures, the master pings each worker periodically, and if no response is received, it is deemed to be a failure. Any map tasks associated with this worker are reset to the initial state, and are reassigned to other workers. Should the master fail, the MapReduce operation is simply aborted, though the authors claim that since there is only one master, the probability of a master failure is low. Along with fault tolerance, MapReduce also tries to take advantage of data locality in order to minimize network communication overhead, i.e. running tasks on the machines that already have the needed data. In addition, running redundant tasks in cases where one or more machines take unusually long to complete a task minimizes the effect of stragglers and reduces the worst case total compute time. Besides this, the authors mention some additional refinements such as skipping bad records, local execution for debugging purposes, and status information that generally enhance the usability and performance of MapReduce. The main strength of this paper is that it introduced a new method (at the time) for automatically handling the details behind running computations on a parallel and/or distributed system. As the results show, the system works as intended on Google’s compute cluster, and is quite robust to failure, despite their efforts to stress test the system by intentionally killing large numbers of worker processes, etc. The widespread adoption in Google for their computational tasks at the time this paper was published speaks to its success in increasing access of parallel and distributed computing to everyone. Given how widely known and used it has become since its initial introduction, it is safe to say the MapReduce has been an important contribution in its field. One weakness of the implementation of MapReduce in this paper is how it does not handle cyclic computations, where the data stays in place and is worked on by various operators repeatedly, such as parameter optimization in machine learning. Doing this would require MapReduce to be repeatedly called, potentially reducing performance. |
The contribution of this paper was describing MapReduce by Google and providing an implementation of it, which works on large clusters of computers. MapReduce is a scheme for large scale distributed computing. It automatically takes care of parallelization, execution scheduling, failure handling, and managing communication between machines. The user does need to implement a map and reduce function however. The map function is a simple computation that happens in parallel across many different machines which takes a (key, value) pair as input and emits a set of intermediate (key, value) pairs. The reduce function combines the intermediate (key, value) pairs into a set of results for each unique intermediate key. Next the paper describes a few examples of distributed computing problems that can be computed with MapReduce, like word-count, distributed-grep, and url-access-frequency, and explains what each one’s map and reduce functions would look like. Next the paper describes how to implement the MapReduce system, which can vary depending on the environment you are distributing computation across. The paper focuses on Google’s environment of large clusters of commodity PCs connected together with switched ethernet. Then we look at the sequence of events that take place when you run MapReduce. Multiple copies of the program is run on different computers in your network, one of which is a special master computer that schedules and assigns map and reduce tasks to idle worker PCs. The workers then compute their map functions and return results partitioned into a user-specified number of partitions. Finally the partitions are sorted on a reduce worker PC and the results are computed for each of the unique intermediate keys. When all of the results are computed, they are appended to an output file. Next we see how the master keeps track of a state (idle, in-progress, completed) and an identifier for each of the map and reduce worker machines. The master also guides computed data from map machines to available reduce machines. The master also sends a sort of heartbeat to each of its worker machines, and if a ping is not received back from a worker, it is marked as failed and set to the idle state. The task is then rescheduled to an available machine and any reduce computers are notified to take the computed intermediate data from the rescheduled machine and not the original. If the master fails, since there is only one, then the whole system fails however, and the user can choose to rerun in that case. The system also attempts to preserve network bandwidth by have as many computations on local input data as possible. The paper also mentions how the number of mapping tasks and reducing tasks should be much larger than the number of worker computers for dynamic load balancing and eased failure recovery. When MapReduce is near completion, it also executes backup tasks for all in-progress tasks, which takes care of straggler workers since it used whichever result completes first between a given in-progress task and its corresponding backup. The paper then describes smaller utilities that were helpful in implementation like a partitioning function and a combiner function which is helpful in the reduce step. Finally the paper analyzes the results by looking at a distributed grep and sort. We see the shape of increased rate of computation as more computers are assigned to a computation, followed by a decrease as the tasks near completion. They also tested failure by intentionally killing nodes in the system and analyzed how backup nodes affected the system (namely a long tail of near inactivity before completion due to stragglers.) I thought the ideas presented in this paper were very interesting. I liked how they integrated this feature with GFS which was seemingly the storage analog, and they used locality to optimize network bandwidth usage. I did not like how the paper had seemingly inconsistent headers in the section on refinements, but this is minor. It just seemed to me that some headers where pieces of code like the “functions” and others were features like “ordering guarantees”. |
MapReduce This paper introduces the MapReduce-one of the great product created by Google. It is a abstract model that specifically design for dealing with huge amount of computing, data, program and log, etc. the underlying idea of MapReduce is to use this new Abstract model to implementation very hard task by only describe what we want to achieve and need not to worry about parallel computing, fault tolerant, loading balancing etc. Because these are encapsulated into a library so that it makes large data computing really easy in implementation. The basic principle of MapReduce is that the computation takes a set of input The main contribution of this paper according to the author: 1: the MapReduce encapsulates parallel processing, fault tolerance, locality optimization, and load balancing. This makes people without experiences in parallel and distributed system easy to use. 2: MapReduce is able to easily resolve many tough problem such as huge data merging, sorting, graph mining and machine learning. 3 MapReduce was implemented and deployed on a big cluster which people can leverage on google's computing source and it makes it easier to get access and use these resources. Furthermore, another meaningful contribution of MapReduce is that it let people realize that restricting the programming model makes it easy to realize parallel and distributed computing. And network bandwidth is the limited resource that needs to be used in an optimized way and the generation of MapReduce is the real implementation of this idea. For the advantages of MapReduce: 1 MapReduce is easy to understand. you can realize parallel and distributed computing by merely implementing some interface. And the program you write will run on a bunch of cheap machines. The MapReduce hides most the difficult things. 2 good extensibility. When the computing resource is scarce, you can empower it by adding more cheap machines. That is the key MapReduce has the ability of processing millions of data. 3 high fault tolerance. Even one of the machine is crashed, the system will transfer the job to another machines so that it will ensure the job will not fail when some of the machines fail. For the drawbacks, MapReduce is not that good at real time computing, stream processing and dependency graph computing because the MapReduce is not design than way. It can not efficiently processing data with many iteration required. This is the aspect I think may be the direction of improvement of MapReduce. |
This paper proposed a distributed data processing approach Map Reduce. The purpose of this project is to solve the problem of processing large amounts of data in a fast and efficient way. Map Reduce is similar to Condor where users submit jobs, a supplied Map and Reduce function to be applied to a large data set, and the system handles all fault tolerance, scheduling, and distribution. The machines in mapreduce system are divided into three categories. The first type is called Master, and the Master is responsible for scheduling, which is equivalent to the foreman of the construction site. The second type is called Worker, which is equivalent to a worker who works. Worker is further divided into two types, one is called Mapper and the other is called Reducer. Suppose we have a huge data set with a large number of elements, the number of elements is M, and each element needs to be processed by the same function. So the Master divides M into a number of small copies, and then each one is assigned to a Mapper. Mapper does the work (execution of the function) and passes the results of his work to the Reducer. After the Reducer, the results of each Mapper are statistically summarized to get the answer of the final task. Of course, this is the simplest statement, because in fact the Master's task assignment process is very complicated, will consider the task time? Is the task wrong? Many problems, such as the burden of network communication, etc., are not described here. The contribution of this paper mainly falls on it proposed a generic idea of map and reduce to solve a parallelizable problem with two stage. Secondly, the mapreduce system do not have requirement for the input data format, which could handle different applications very well. Third, the implementation is failure tolerant, the master detects worker failures and schedules re-execution of these tasks. The system is resilient to large-scale worker failures. One weak point that I find is that when a very slow node is in the network and working properly, it will slow down the whole map reduce process. |
Mapreduce is a programming model that makes it easy to process large datasets using a distributed system. There are two phases to mapreduce, appropriately named “map” and “reduce.” In the map phase, the input data is read and some computation is performed, producing an intermediate output of key-value pairs. This computation is spread across many nodes. In the reduce phase, the data is brought together to form the final output. All of the values for a key are brought together in the reduce phase. There is also an optional “combiner” phase. The classic example used for mapreduce is wordcount. Map functions read words from files and emit (word, 1) whenever a word is seen. Then, the reduce function gets some number of (word, 1) pairs for each word, which it can combine to get the final count. I have enough experience with hadoop that I already had some idea about optimizations for number of mappers and reducers, but I found the information on it interesting. The idea is that there should be significantly more of each than the number of available machines, which helps with load balancing. There is also some information on backup tasks, which essentially run backups of remaining in-progress tasks when the operation is close to being completed. This helps in cases where there may be problems with the machine that the task has been running on. It’s a simple and easy way to improve performance. The main technical contribution of this paper on a high level is that it provides a framework to spread data processing across multiple threads/machines without needing to handle the difficulties that go along with that (parallelization, fault-tolerance, locality, load balancing). I believe that mapreduce truly was the start of an era in data processing, and it is still used heavily today in its many forms. The paper does a very good job focusing on what is important to understand mapreduce. Instead of focusing on systems like GFS, which is quite important to mapreduce, the paper quickly mentions it and moves on. They do the same when referring to the cluster management system. I think that this keeps the focus of the reader on what is really important and novel in the paper. I thought that this was a really good paper, but there were some weaknesses. One was that the authors only gave positive examples of problems that work well with mapreduce. The paper doesn’t give any examples of the limitations of the mapreduce framework, or what it can’t do well. I’ve noted in industry that there is some severe overuse of hadoop for simple problems - for instance, I’ve seen someone use a single mapper which really doesn’t make much sense. Therefore, I think this kind of negative example is important. One other thing that I thought was odd was that the results section mentioned that 1-1.5GB of memory on the cluster used was reserved for other tasks. I understand wanting to test against realistic environments, but it seemed like this would make it harder to reproduce results. |
This paper introduces MapReduce, a programming model and an associated implementation for processing and generating large data set on large clusters. This programming model is successful due to three reasons. First, it hides the messy details of parallelization, fault-tolerance, data distribution, and load balancing. Engineers who have no experience with parallel and distributed systems can use it to process a large amount of data. Second, a large variety of problems can be expressed in the programming model, such as sorting, grep, word count, etc. Finally, the implementation scales to large clusters with thousands of machines, which is a typical setting in an industry environment. At the core of this programming model are two functions: Map and Reduce. The Map function takes an input key, value pair and produces a set of intermediate key, value pairs. These intermediate pairs will be grouped together based on the key value and sent to the Reduce function. The Reduce function accepts a key and all the values corresponding to it, and merges these values to form possibly smaller output values. The whole system works as follows. Once the MapReduce program starts, it splits the input files into M pieces, each corresponds to a map task. Then it starts up a master server and many worker servers. The master will assign the M map tasks and R reduce tasks to idle workers. For a worker that picks up a map task, it will parse the assigned input files, call the user-defined map function for each key-value pair extracted and store R partitioned results (through a partition function) locally. To execute a reduce task, the worker communicates with the master to figure out the location of the intermediate results and fetch them to local memory (or storage). It then sorts the keys (since multiple keys may be in the same partition), and execute the reduce function for each key. Finally, each reduce task produces a result file. Fault tolerance of worker failure is provided by master maintaining the status of each map and reduce task and the heartbeat message between the master and workers. Once the master finds a worker has failed, it will restart the task assigned to it on another worker. However, there’s no protection against master failure. Other details include locality where the master always tries to start a map task on the worker where the input files are stored and backup tasks for preventing “straggler” machines slow down the whole MapReduce job. It does so by schedules backup executions of remaining in-process tasks when a MapReduce job is close to completion. One major drawback I found is that there’s no protection against master failure. Although the authors argue that there’s only one master server and unlikely to fail during the job, I don’t think it’s true for long-running jobs on a commodity hardware. |
In the paper "MapReduce: Simplified Data Processing on Large Clusters", Jeffrey Dean and Sanjay Ghemawat discuss their implementation of MapReduce for Google workloads. At Google, there have been many implementations of special purpose computation that processes large amounts of raw data (TB or PB). These often require thousands of machines and complex parallelized code to finish within a reasonable time. In reaction to this, Google has decided to create an abstraction that hides the messiness of parallelization, fault tolerance, data distribution, and load balancing - using a map and reduce as inspiration. Most of the computations used to derive processed data involve applying a map operation to records and applying a reduce operation to all values that shared the same key. Thus, the functional style is consequently "auto-parallelized" and can be executed on a large cluster of commodity hardware. When we realize that this means programmers do not need full knowledge of parallel and distributed systems to utilize its resources, it becomes clear that this is a problem worth addressing. Map and Reduce, which are both written by the user, work in tandem with each other. Map takes an input pair and produces a set of intermediate key value pairs. All values associated with a particular intermediate key are passed to Reduce. Reduce takes in the intermediate key and a set of values of that key and merges them to form a smaller set of values. These values are processed by the user's reduce function via an iterator - this enables us to deal with lists that are too large to fit in memory. One thing to note is that even though the input and output keys differ in domains, the intermediate keys are of the same domain - this is so Map and Reduce can effectively communicate with each other! The rest of the paper talks about the different guarantees with this system: 1) Execution Overview: Input data -> M splits. Master tells workers what to do. Map workers create intermediate key value pairs. We have R reduce pieces. Reduce workers read buffered data and sort. They then create the output of the reduce function. Lastly, the master wakes up the user program and returns MapReduce. 2) Fault Tolerance: If Master fails, we operate off a copy of it. Workers are pinged every so often - if they don't respond, they are presumed dead. 3) Task Granularity: M and R are tuning parameters used to control performance. 4) Backup Tasks: Some computers are bad and will be bottlenecks. When this happens, the master runs the leftover processing on other devices. 5) Partitioning & Order Guarantees: Partition with a hash (Mod R). We get a guarantee on sorted keys which makes it easier to produce a sorted output files per partition. 6) Skipping Bad Records: Sometimes bugs in user code can lead to error. The master will keep all instances of such an occurrence and will skip if it has happened previously. 7) Status Info: Meta data of all processes, failed workers/masters, and all tasks are available. This is for debugging purposes. With all the benefits gained from this system, there are still some drawbacks. The first drawback is when they are discussing the performance of MapReduce by processing 1TB of data. The problem with this is there is no benchmark, a standard to test against. They simply show the run times of specific workloads with different parameters. Thus, it's quite hard to reason why anyone would want this for performance reasons. Second, I believe the data that shows them using MapReduce at Google is a bit contrived. It hints at an exponential growth rate for using MapReduce, but I believe that is simply because it was new at the time and started picking up popularity soon after. Lastly, I believe they could have mentioned previous works earlier and used them as a guideline for creating MapReduce. |
This paper describes the MapReduce programming model, used for effectively computing simple operations (such as aggregation) on extremely large datasets. MapReduce programming is relatively simple, and is generalizable to the point where it can easily be executed in highly parallel systems. MapReduce operates on key-value pairs, and is split into two main operations. The Map function takes as input a key-value pair, and outputs an intermediate key-value representation that the Reduce function can use. The Reduce function takes a key, and a set of associated values, and combines them. It then outputs the single combined value. Every element of the input set is passed into the mapper. The outputs of the mapper and reducer work as valid inputs for the reducer, so the reducer can be called to combine sets of values with related keys until each key has a single value, created from aggregating all input values with that key. Importantly, each tuple can be mapped independent of all others, and each group of values can be reduced independent of all others. As such, every map and reduce operation can be done in parallel with any others. For any MapReduce job, the user selects the number of mapping jobs to be done, and the number of reducing jobs. Usually, these numbers should be greater than the number of possible workers, so that workloads can be balanced more easily. A record is kept of the progress of each job, so if any worker fails a job, the job can be redone by any other worker. If a certain record constantly causes failures, then MapReduce can skip that record if perfect accuracy isn’t required. One of the most important parts of MapReduce is the separation of the logic of the MapReduce job that has to be executed from the parallelization of the job. This can make programming far easier, as the programmer doesn’t need to worry about the parallelization aspects. In addition, debugging can be much easier, as it’s simple to run all of the MapReduce tasks serially instead of in parallel. One possible downside that the paper doesn’t mention, however, is how optimization may conflict with MapReduce’s generality. The parallelization of MapReduce is based entirely on the Map and Reduce functions themselves, so the parallel execution is the same regardless of these functions. However, it may be that optimal execution of the MapReduce plan would be significantly changed based on what the functions themselves actually do. |
The paper presents MapReduce, a programming model for processing huge amounts of data in parallel on cheap, commodity machines. Tons of data are periodically processed to generate different kinds of statistics or for performing other computations. The inherent possibility for partitioning this data enables systems to perform tasks in parallel on large clusters. The authors indicate that there are lots of such data sets and associated operations performed within Google which motivated the design and implementation of MapReduce. The input and output are both a set of key/value pairs. The user provides a map and a reduce function. The map function produces a set of intermediate results, which are then combined by the reduce function to produce the final result expected by the user. One instance of the user program acts as a master which creates map / reduce workers as needed. The master also acts as the intermediary for communication between the map / reduce workers. The intermediate results are stored locally on the disk, while the final results are stored in GFS. The MapReduce framework depends on two systems for proper operation. The first is a distributed file system for storing input and output data from MapReduce programs. The second is a scheduling system for managing a cluster of machines. The contributions of the papers are: (1) It recognizes that many problems that have such a property: work on an input(the input can be represented a set of (key, value)), reorganize the input(manipulation of the keys, values. In the functional language such like Lisp, they can be naturally modeled as map, reduce function. A parallel programming model can be thus provided to parallelize such calculations. (2) Keep the complexity of fault tolerance from programmer. Runtime monitors the status of worker node, re-execute the task if a node fails to response within a period of time. (3) Take the advantage of locality, the scheduler could make the input files local to the mapper and the intermediate results local to the reducer. Some weak points may be the safety issue of the single master and the flexibility issue for workloads that MapReduce can not handled. |
This paper introduces MapReduce, a programming model and an associated implementation for processing and generating large data sets. The basic idea of this model is that users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. This programming model is especially good with parallel distributed computing. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations. The paper starts off by giving examples to illustrate how MapReduce works. Basically, MapReduce has two key components: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. With these two components, workload can be distributed over many parallel machines, which will save a lot of computation time. The paper then goes on explaining how MapReduce is implemented. The basic execution workflow is that the MapReduce library in the user program first splits the input files into M pieces, the master picks idle workers and assigns each one a map task or a reduce task, a worker who is assigned a map task reads the contents of the corresponding input split, when a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together, the reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, and finally the output of the Reduce function is appended to a final output file for this reduce partition. For failure tolerances, different components have difference approaches. To avoid worker failure, master node pings work nodes periodically. To avoid master failure, they use checkpoints. For task granularity, they subdivide the map phase into M pieces and the reduce phase into R pieces, as described above. Ideally, M and R should be much larger than the number of worker machines. Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails. The paper then introduces a few extensions for refinement. For evaluation, the paper assesses two key aspects: grep and sort. The advantage of this paper is easy to understand. Before going into technical details, the paper presents several MapReduce examples to illustrate how Map and Reduce work. These examples are more clear than verbal explanation. The disadvantage is that this paper does not contain experiments data that compare MapReduce against other distributed programming model. Those experiments could further show how useful is MapReduce. |
“MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2014, by Jeffrey Dean and Sanjay Ghemawat describes a new technique for completing large-scale data processing tasks which is easy for developers to use and addresses fault tolerance concerns. For a particular task, a developer writes a “Map” function for mapping an input (key, value) to an intermediate (key, value), to perform the computation for each particular input item. The developer also writes a “Reduce” function, which takes all the intermediate (key, value) items of the same “key” and merges all of the corresponding “values” together; the merge may be counting the values, summing the values, or simply merging them all into a single list, or other possibilities. These Map and Reduce functions are applied to particular data in a master/worker architecture. As worker machines are available, the master machine assigns Map tasks to workers, providing a location reference to the data for a particular Map task; in many cases the data will already exist on that worker, but sometimes a worker will need to communicate to another worker to read the appropriate data. After a worker has completed a Map task, it writes the intermediate results to its local memory and partitions them into the appropriate R splits for Reduce. As the necessary intermediate data become available, the master assigns workers Reduce tasks, telling the worker where the relevant data lives; workers write output data to files. The master keeps track of whether worker machines have failed, and if they have, reschedules Map and Reduce tasks appropriately. The paper also discusses other ways to improve performance or tailor MapReduce to particular needs: “backup” executions to workaround straggler machines, custom partition functions, guarantees on key-ordering, combiner functions for pre-processing intermediate data (to reduce amount of data sent over the network), and more. The paper also presents performance evaluations of MapReduce for Grep and Sort, where they show data transfer rate over time and how backup tasks and machine failures affect performance. Finally the paper describes a usage of MapReduce at Google, and related and future work in large-scale, parallel, distributed systems for data processing. The paper does a good job explaining the MapReduce paradigm by first explaining the data processing, as if it were just on a single machine (sections 2.1 and 2.2), and then later in section 3 explaining the master/worker architecture for assigning tasks (including data references), completing tasks, and communicating status, as well as for addressing challenges like machine failures and conserving network bandwidth. I also appreciate that the authors test MapReduce at Google on real-world, large-scale, distributed data processing problems. This demonstrates its usefulness and scalability. The related work compares and distinguishes MapReduce from prior work, but I wish the paper abstract and introduction concisely described these similarities and differences as well. This would make it easier for me to understand the research contributions of the paper. |
This paper proposed MapReduce, which is a programming model for processing large data over several(many) machines. This project was inspired by the workflow inside Google. They have hundreds of special-purpose computations, which are conceptually straightforward, over large input data so that the computation can't finish in an acceptable time unless it is distributed on hundreds or thousands of machines. The main contribution of the paper is to provide a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, such an idea over computation can relief engineers from considering all the issues of parallelization, fault-tolerance, and data distribution, .etc. The strong part of the paper is to give a whole workflow and model design of MapReduce.Examples are given to help illustrate the programming model, which are mainly the "map" and "reduce" function. I really like the examples in Sec 2, since it gives me an overview of how MapReduce works and what we can do with MapReduce. Also, the implementation detail in Sec 3 considering the environment of Google itself seems reasonable. Especially, the fault tolerance sub-section is a detailed description of how to deal with fault, which is almost to appear in a large data computation over hundreds of machines. The weak part of the paper is that I doubt whether the idea of MapReduce is novel enough. As the paper said, the Map and Reduce function are largely used in functional programming languages like Lisp. Also, another database provider named Teradata seems used such techniques years ago. |
In this paper, the author introduced a novel programming model called MapReduce which is used to process and generate large data sets. The problem they are solving is to design and implement a novel programming model which can be used to process big data. This problem is important because there are many modern databases are distributed and parallel on clusters, so that it is convenient to provide a system that takes care of the details of various work, such as partitioning the input data, scheduling the program’s execution, handling machine failures, and managing the inter-machine communication. In order to deal with these problems, MapReduce is introduced. MapReduce was designed to simplify the task of writing code for parallel execution in a distributed system, which may otherwise be a tedious task for programmers with no experience in such a framework. Next, I will summarize the critical points of MapReduce with my understanding. MapReduce can be regarded as a simple task which takes a set of input key-value pairs and produces output key-value pairs. The key components of MapReduce are the Map and Reduce methods. All intermediate pairs that share a key are sent to the same Reduce function, which converts these intermediate pairs into a set of output values, which are often much smaller than the input. MapReduce uses the master data structure, in which the master keeps several data structures for the clusters, including the state and the identity of the worker machine. Besides, because the data is distributed in many different machines, its library has to tolerate machine failures gracefully, such as worker failure and master failure. The fault tolerance model for MapReduce is not very complex. The master server stores the machine ID for each Map and Reduce tasks as well as their states, which might be idle, in-progress, or complete. The master tries to ping each server frequently. If the master fails they are able to log its progress and restart from where it dies, they did not do that in their initial implementation they just aborted the task and restart it. This is because there is only one master and its chances for failure are low. For model refinement, the paper talked about the data partitioning data partitioning function, ordering guarantees, local execution, and etc. From the experiments, we find that the performance of MapReduce is impressive. Generally speaking, this is one of the most famous idea proposed by Google and it is definitely a great paper with great impact. The MapReduce programming model had been adopted by many distributed computing frameworks and make a contrition to some analytical tasks and graph processing, machine learning and etc. MapReduce is simple but powerful which enables automatic parallelization and distribution of large-scale computations. Its implementation of MapReduce interface does achieve very high performance on larger clusters of commodity PC. Besides, this paper shows several different ways on how to use the MapReduce model to implement several functionalities. The MapReduce contains robust fault tolerance mechanism by monitoring. As a programming model, MapReduce also features high flexibility, one can design any kind of map and reduce functions to perform various analysis tasks. A pipeline of the map and reduce functions are constructed to achieve complicated tasks. I think the downsides of this paper are minor. I think they can spend more time explaining the bad records, I wonder if there is any detection when the number of bad records is considered too large as to influence the overall computation result. Also, the performance evaluation part is a little bit short, it only compares the different result about MapReduce in different configurations, but it does not compare with other models, I think if they can compare with some other distributed processing algorithm to make people convince the performance of their product. I used MapReduce before and I use it for graph processing, I find a potential problem of MapReduce is that it uses a lot of I/O, the intermediate results are always stored on disk and then read by the next round, I think maybe they can introduce some compression mechanism to reduce the heavy I/O from frequent reads and writes. |
The paper introduces MapReduce, which is a framework developed by Google to parallelize data processing tasks in a way that is easy for developers to use for their own tasks. It’s major features are: 1. A distributed architecture that consists of a master node and worker nodes. The master coordinates work to the worker nodes, and checks on them via heartbeat messages. When a worker fails, the completed tasks by the worker are rolled back and completed by another machine. When a master dies, the MapReduce operation is aborted as this situation is rare. The master is also responsible for replicating the work of final “straggler” machines to make sure that doesn’t drag the performance of the system down. 2. A “Map” operation, which is defined by the user. The map operation takes key-value pairs and outputs intermediate key-value pairs. The input for the map will be partitioned by MapReduce across lots of commodity machine worker nodes. The output (intermediate key-value pairs) are then written to intermediate files. 3. A “Reduce” operation, which reads its input from the intermediate files and is similarly parallelized as the Map operation was. The Reduce operation merges values with the same intermediate key so that it can perform its operation on them (especially useful for tasks like counting the # of occurrences of X, etc). The main contribution is the MapReduce framework, and while one of its main advantages is obviously its parallelization ability and performance, I think the biggest strength is the fact that it allows developers to decouple parallelization with the rest of their code. The fact that developers only have to focus on writing Map / Reduce operations (plus adding optional hyper parameters) is a major advantage. I also think the paper was very well-written and easy to understand at a high level while still providing several insights on implementation details. At a high level, I don’t think there are many disadvantages to MapReduce as a high-level concept. One thing that I think is important to note is that this “version” of MapReduce is definitely suited to Google’s environment (I.e. lots of commodify machines, rather than small number of shared-memory machines). A separate version of MapReduce could possibly handle other environments better. Also, MapReduce can only be used for certain types of problems—it can’t parallelize something if it can’t be expressed as a “Map” and “Reduce” operation. |