Programmers and data analysts working with big datasets often need to perform simple operations such as filtering or counting occurrences. These operations can be formulated as a map step, where a function is applied to each tuple or line in a large distributed array or file, followed by a reduce step, which aggregates the map's result tuples. For example, grep-ing a large file for lines containing a target string can be run as a filter step, followed by a join. Previously, it was difficult for programmers to formulate such programs quickly, because they had to write their own handlers for load balancing, partitioning, and fault tolerance.|
The MapReduce framework allows programmers to worry only about formulating their programs as map and reduce operations, letting the framework do the heavy lifting of splitting the load over a distributed file system, handling failed nodes, and collecting results. A cluster of several hundred nodes can perform an operation like sort or grep on a terabyte of data or more in just a few minutes, most of which is spent reading and shipping the data over GFS or HDFS. Programmers can develop and debug such distributed programs more easily, as the framework handles parallelism and distributed execution details for them.
The MapReduce system begins by breaking input data into pieces of roughly 64 MB, of the same order as chunks in Google File System, which MapReduce runs on top of. One master node assigns workers to perform the map operation on pieces of the data. As each worker produces map outputs, it partitions them into one region per reduce node and writes them to disk. When the worker finishes, it reports the locations of each partition to the master. The master keeps track of how long each map job takes, and reassigns jobs that do not complete quickly. The master informs each reduce node of the locations of its partition's data on the input worker servers' disks. The reduce nodes collect this data, reduce it, and report the solution's location on their own disks.
One limitation of MapReduce is that, as noted in the Spark paper, many useful algorithms cannot be expressed as acyclic data flows. For example, iterative algorithms require multiple passes over the same data, which requires multiple MapReduce jobs being run in sequence, each with its own read and write I/O steps.
This paper focuses on MapReduce, a technique to simplify data processing on large clusters of machines. Its motivation lies in large scale data processing with the input being raw data such as crawled documents, web request logs etc and getting inverted indices, web page graph structure, top queries in a day etc in the form of output. Such form of input is usually large in amount and is difficult to deal with. To deal with this, MapReduce was devised maintaining the services of parallelization, data distribution, fault-tolerance and load balancing.|
MapReduce makes use of a set of key/value pairs for the purposes of input and output. The map function processes the input key/value pair to generate intermediate pairs. The reduce function takes all the intermediate values for a particular key and produces a set of merged output values. This can be used with various programs such as for distributed sorting, formation of an inverted index etc. Parallelization is carried out by partitioning the input into equal-sized splits and intermediate key space into pieces as well. Fault tolerance is achieved by re-execution of completed and in-progress map tasks with the re-execution of the in-progress reduce tasks. Refinements such as making backup executions of time taking in-progress tasks, partial merging of intermediate data at map worker to reduce data sent over network and skipping of bad records is also done.
The paper is successful in highlighting the useful abstraction of MapReduce. It greatly simplifies large-scale computations and is in use by large corporations such as Google. It focuses on the problem and lets the library deal with the messy details. The argument is sufficiently supported by evaluations and benchmarks on various programs.
MapReduce is closed-source to Google and is written in C++. Need for open-source Java-based rewrite is required which is not addressed in the paper.
What is the problem addressed?|
As the increasing heat in parallel computing, there are problem for parallelization and execution program on a large cluster. Most such computations are conceptually straight forward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. They designed a new functional language that allows them to express these simple computations without the messy detail of parallelization.
The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.
1-‐2 main technical contributions? Describe.
The mapreduce first splits the input file into M pieces, and starts up many copies of the program on a cluster of machines. One of the program is the master, and the rest are workers that are assigned work by the master. The master picks idle workers and assigns each one a map task or a reduce task. A worker assigned a map task reads the contents of the corresponding input split and the intermediate pairs are buffered in memory. Periodically, the buffered pairs are written to local disk, partitioned into R region by the partitioning function. When a reduce worker is notified by the master is notified about these locations, it uses remote procedure calls to read the buffered data from the local disk of the map worker. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
1-‐2 weaknesses or open questions? Describe and discuss.
I strongly doubt that mapreduce is Turing complete, and can be only applied to restricted breath of problem. Moreover, those problem shouldn't be hard to paralleled in general which kind of lose the validation of propose a whole new programming language.
This paper presents a new programming model called MapReduce, which is also an association implementation for processing and generating large data sets. In short, MapReduce takes a set of input key/value pairs and produces a set of output key/value pairs through function Map and Reduce respectively. There are a lot of tasks in real life can be expressed in this model, and thus this model is very useful for program that are automatically parallelized and executed on a large cluster of commodity machines. This paper first gives an overview of MapReduce model, as well as some examples about how to take use of it. Second, it gives the implementation, and several useful refinements. Then it presents some performance measurements for several tasks, as well as how it is used within Google. Finally, it provides some related and future work.|
The problem here is that many tasks need to be parallelized or executed on thousands of machines (distributed system), but they are not easy to understand and requires a lot of experience on parallelization, fault-tolerance, locality optimization and load balancing. Therefore, a good programming model for programmers to write parallel programs is needed. MapReduce can solve this problem, and provides the easiness for programmers. Now programers can just fill in the Map and Reduce functions, and the system will take care of the rest.
The major contribution of the paper is that it provides a good model that provides easiness for programmers to write parallel programs without experience with parallel and distributed systems. Also, it is very delightful that many tasks in real life can be expressed using MapReduce model. In fact, there are many tasks with Google are using MapReduce, and it really improves the performance and efficiency, including generation of data for Google’s production web search service, sorting, data mining, machine learning, and so on. One weakness of this paper is that it did not provide enough performance evaluations. It would be better if them provide more performance tests on several different workloads.
One interesting observation: this paper is very intuitive, and the idea behind this model is very easy: one function to produce the intermediate values associated with the same intermediate key, another function to take intermediate key/value pairs and output final key/value pair. They divide the tasks and make it easier to run in parallel and distributed systems. This idea can be applied to other problems too.
In this paper, the authors introduce MapReduce which is a programming model and an associated library for processing and generating large data sets, especially for parallel/distributed systems. Generally, MapReduce model is a two-step function: map and reduce. Map function process a set of key-value pair (or say, raw data) to generate a set of intermediate set of key-value pair. Reduce function merges all the intermediate values and keys according to users program.|
While the procedure of MapReduce looks simple, but several important issues are left to be pointed out. Since this model considering a large cluster of machines, errors and breakdowns of some units are unavoidable. Therefore MapReduce presents several rules for solving different level errors.
Since being introduced by Google in 2004, MapReduce gained lots of popularity and nowadays we see it in almost every tool and software related to big data. I've used MapReduce model many times in Hadoop and Spark and based on my experience the most important feature of this programming model is that MapReduce is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.
This paper discusses a framework that has since become synonymous with "big data" and "parallel computing" - MapReduce. This paper by google lead to the creation of Hadoop, HDFS, and the current big data explosions. MapReduce is a programming model that is built around the Lisp primitives *map* and *reduce*. Developers implement these two functions (perhaps several times, depending on the computation that they are trying to perform), and the framework takes care of everything else. The map function takes and input set of key,value pairs and returns a new set of key,value pairs. The reduce function accepts all the key,value pairs with the same key, and performs some computation on that set.|
The paper spends a good amount of time discussing what "everything else" is, exactly. The first goal of the framework is to provide automatic parallelization and distribution of large-scale computations. It does this by distributing the map and reduce function executing to machines based on key value. The system also seeks to take care of hardware failures, as well as slow computations, "stragglers", both of which occur frequently since the framework is designed to run on large numbers of commodity machines. As the computation looks like it is coming to a close, the MapReduce system will start additional copies of jobs that are still running. The idea behind this is that, if a machine is still running a job when most other machines have finished, than it is a slow machine, and the faster machines are now ideal - so we schedule the job again, and take whichever one finishes first.
The paper tries to present a framework that can be used to easily and quick program jobs to run on a massively parallel distributed system. In this, it was very successful - these ideas are used in most large companies today. It almost seems to have become too successful - companies seem to be trying to put everything on hadoop/mapreduce, even when there are better solutions to the problem. Because it allows so much parallelization, blatantly inefficient brute force computations are possible, even if they are unwise.
This approach is very brute-force. It doesn't take advantage of indices, locality, memory - every computation requires a full scan of the data. I also don't know if the premise of this paper holds - I'm not sure it's a good idea to have people with no knowledge of distributed systems and parallel databases writing distributed applications - you should understand the tools that you're using.
However, one cannot deny that this paper has dramatically changed the course of database evolution.
Problem and Solution:|
MapReduce is proposed by Google. It is used to process a large amount of data on thousands of CPUs with high parallelization and distribution, as well as fault tolerance. And it helps the user to monitor the status and I/O scheduling easily. The Google File System is used to store the corresponding data in the distributed system. In MapReduce, two functions called Map(key, val) and Reduce(key, vals) are always used. Map runs on all data items, and creates the data set for each key. In this way, all the data are assigned to different groups with different keys. Reduce runs on every key to calculate the result of its group and output the result. And it handles the fault in the similar way to the Google File System: detect the failures via the heartbeats. Also, if the program breaks in Maps, the running map tasks as well as the computed ones are executed again. While if it breaks in Reduces, it only executed the running reduce task again.
The main contribution is that it provides a widely applicable parallel computing algorithm. It means that MapReduce does a great work to simplify the complex distributed computations and improve the efficiency. Also, there are some other reasons to make it popular in many fields. One is that it has fault tolerance handle the failures of both the works and the master. The other is the status monitoring. With the MapReduce model, the improvement on the related area like machine learning, indexing should increase faster.
Though MapReduce is a great way to process large amounts of data, I still find some weakness of the algorithm. One is that it cannot do real-time processing, which is a trend in the current computer science area. And it is very expensive to shuffle the data, though the processing is fast.
This paper introduces the MapReduce programming paradigm. In companies that require large data centers like Google, they need to process extremely large data on thousands of machines for maximum efficiency. However, it can be difficult for programmers to learn how to parallelize their workload to be both scalable and fault-tolerant. MapReduce provides a simple programming model that allow users to easily parallelize their workload to be deployed on such systems.|
MapReduce abstracts the workload in two steps: map and reduce. The map operation is when each logical “record” of the input is processed to generate intermediate key/value pairs. Then the reduce operation takes those intermediate values that share the same key and combines the data into the final output. The MapReduce model partitions the large input data to M machines that each process the Map operation locally. The intermediary data from the Map operation is stored in local disks, and is read by R nodes that each perform the Reduce operation. The final output is then written to a shared data space.
The big advantage of MapReduce model proposed by the paper is its resistance to failure and performance consistency. MapReduce system is fault tolerant because it can re-execute any map or reduce machines that failed, since the computation is purely local except for the results. Combined with its design to folk backup tasks in case the current processes fail, MapReduce showed only a 5% performance overhead even when facing large amount of machine failure.
The paper does an excellent job introducing a novel programming model, and MapReduce shows both great performance and simplicity. However, the paper is a bit light when it comes to comparing the model to other existing parallelization framework, or how well MapReduce really scales in overall performance.
This paper introduces a programming model, MapReduce, which is designed to be implemented in the parallel computation using large clusters. It divides a parallel process into two phase, map and reduce. User can specify a map function to splits the input into many intermediate key/value pairs, and a reduce function to merge all the intermediate result into several result files. The underlying parallel process is handled by the model. This paper also introduces an implementation of MapReduce on the GFS clusters. It uses manager-worker process model. There is a single manager(master) process, which stores the states of the tasks and worker processes and assigns map or reduce tasks to worker processes. This implementation optimizes the performance using backup tasks and failure tolerance on the working cluster. It builds a error-handling mechanism that the master process reassigns the task to different worker process when some other worker process fails to execute the task. In practice, this implementation is resilient to large-scal worker failures.|
MapReduce is an elegant abstraction from lots of parallel computing processes that deal with tons of data, which itself is straight forward and simple. It makes a lot easier for programmers to handle large-scale computing task elegantly without worry about many non-trivial problems like load balancing. The only thing they need to do is to fit the problems into the MapReduce programming model. This programming model also comes with a debugger tool, which makes the life even easier.
MapReduce shows a great advantage at failure handling in its implementation. Failure tolerance is one of the most desirable feature in large-scale computing. Since there might be more than thousands of machines participating the computing task, failures on some or even more machines are quite common and it has to be handled during the computing process. MapReduce provide a good potential in such requirement. In the implementation introduced in this paper, the master process detects any possible failure on worker processes and redistribute the corresponding tasks if there are failures. It makes the computing processes still work even under failures of many machines in the cluster at the same time.
However, MapReduce has some drawbacks in nature:
MapReduce is a restricted programming model and it might not be the most effective approach to deal with computing task in certain scenario. It is hard to fit everything into the MapReduce programming model. Some tasks though inherited with parallel nature but may not suit in this model, like OLTP workload. There are many transactions that can be execute in parallel but each of them is really light-weighted. As a result, in OLTP workload, MapReduce may have huge communicate overhead due to the scheduling the task among workers.
It is the famous MapReduce paper. The paper explains the MapReduce programming model, which has enabled the processing of very large-scale data in multiple commodity machines and contributed significantly to the era of Big Data.|
MapReduce is basically a two-phase algorithm, consists of map and reduce functions. A map function takes input key-value pairs and generates intermediate key-value pairs that will be processed by a reduce function. A reduce functions sequentially takes an intermediate key and the list of every value that has the same intermediate key and processes them. Inspired by map and reduce primitives, the authors have found that an infrastructure to support these two functions and distribute data among multiple nodes works very well to support many applications in large-scale data processing problems.
The paper had a huge impact in IT industry, because not only it identified a crucial problem of processing very large-scale data at the time, but also suggested a practical programming model that could be used to solve the problem. If this was only theoretical, it could not have been as popular as it has been. Another big contribution of this paper is that it also addresses the number of essential issues that need to be considered when applying this programming model into multiple commodity machines, such as fault tolerance.
The paper could have been better if it addressed the issue of data skewing as well. The paper does not mention much about it even though it is a pretty common problem in using MapReduce programming model nowadays. The authors might not be aware of the problem or thought that it is not as significant as it is considered today at the time. We never know for sure, but I suspect that the authors did not have a clear solution to the problem as it is still a difficult problem to address today.
In conclusion, MapReduce is a programming model that has contributed significantly to the era of Big Data and is still relevant to this date. MapReduce did have its own problems (e.g., difficult to program, etc.), but it led to many other applications that built upon it or extended it in an attempt to solve its problems and improve the ways of processing very large-scale data even further. It is important to read such seminal paper if you want to understand the essence of the problems and applications in the Big Data era.
This paper describes Google's MapReduce algorithm. The algorithm was developed to simplify the processing of large amounts of raw data. MapReduce provides a simply programming interface consisting of two user defined functions. The Map function takes a key, value pair as an input and produces a intermediate pair that the library uses to group together values with the same intermediate key. The Reduce function takes a list of values corresponding to an intermediate key and merges the values together. The key contribution of the work is to provide a simple programming interface that allows for automatic parallelization of large scale computations. The library also has built in fault tolerance which is key for the environments it is expected to work in such as large clusters of machines, which may fail at any time.|
One of the problems I had with the paper is that I felt like there was no handling of when the single master fails. In any environment where there is a single coordinator, there is also a single point of failure. However, the paper simply mentions that because there is only one master, it cannot fail and I do not understand this jump in reasoning. To me, it seems like the MapReduce library assumes that the machine the master is running on will never fail, and this doesn't match the assumptions of the environment it is working it. Another problem I have is along the same lines. Suppose the master machine does not fail, but instead gets partitioned from the network due to a network partition. In this case, it would have assumed that all of the workers have failed and attempts to re-execute all of the map tasks. If it does get restored, it would invalidate the work of all the workers. It does not look like there is a built in mechanisms that handles this case. I think the library should not assume the master will always be present and either have workers check in with the master periodically to make sure it is still present or have a backup master setup. The backup master would poll the state of the current master and take over in case of failure. This would solve both of the problems I mentioned.
This paper describes the MapReduce programming model and its implementation, especially in Google's environment. MapReduce simplifies very large data processing on large clusters by providing an abstraction that hides the computation parallelization, data distribution, and other details. |
MapReduce consists of two user-defined functions, which is inspired by functional languages. The two functions are described as below.
1. "Map" takes a set of key/value pairs as input, and generate a set of intermediate key/value pairs.
2. "Reduce" take an intermediate key and a set of values for that key. It merges the set of values together.
The input data and the intermediate data are distributed over several splits, so the map workers and the reduce workers are able to work in parallel. There is one special node called master that assigns work to other worker nodes.
There are several other issues need to be taken care in MapReduce task such as fault tolerance. They handle worker failures simply by re-executing the tasks. Master failures can be handled using periodic checkpoints, however, they simply abort the MapReduce computation once the master fails since this is very rare. In addition, the choice of M and R (number of partitions of map tasks and reduce task) is also an issue. This paper mentions that M should be chosen that individual map task is of size 16MB to 64 MB, and that R should be chosen as a small multiple (for example, 2.5) of the number of workers.
This paper adds some additional features to refine their MapReduce library such as allowing user-defined partitioning function, user-defined partial merging function "Combiner", event counters, ...etc.
This paper gives an overview of the MapReduce technique, some example tasks suitable for MapReduce, implementation details, and some refinements can be added on top of the MapReduce model. The model is very simple and straightforward but extremely powerful. We can see it as an example showing the importance of good system abstraction. This paper is published in 2008 and has been cited over 15,000 times. MapReduce serves as a very important basis for many parallel and distributed systems.
Some resources mention that Google has already abandoned its MapReduce system and replace it with "Cloud Dataflow". It is said that the new system does not have the scaling restrictions of MapReduce.
Motivation for MapReduce|
MapReduce is necessary as a technique used to process large datasets, for applications such as processing raw data, like crawled documents and web request logs, and computing derived data, like inverted indices, graph structure representations of web documents, summaries of the number of pages crawled per host, and set of most frequent queries in a given day. The computations are conceptionally simple, but large amounts of input data forces the computations to be distributed across many machines, adding complexity to the problem. MapReduce is thus needed as an abstraction that allows users to express simple computations, while, behind the scenes, partitioning the input data, scheduling the program’s execution for the set of machines, managing communication between machines, and handling machine failures. User can thus easily use resources of a large distributed system. MapReduce generates intermediate key/value pairs with a map function, and then merges the intermediate results with a reduce function. It is implemented on large clusters of commodity machines in parallel, and is highly scalable. Terabytes of data can be processed on thousands of machines with MapReduce. MapReduce makes working with parallel distributed systems easy, and so hundreds of MapReduce programs are implemented and thousands jobs are executed on Google’s clusters everyday.
Details on MapReduce
MapReduce computation takes a set of input key/value pairs and produces a set of output key/value pairs. Map takes an input pair to produce a set of intermediate key/value pairs, and intermediate values associated with the same intermediate key are grouped together and sent to Reduce functions. Each Reduce function takes an intermediate key and the set of values associated with that key, merges the set together to form a smaller set of values, and outputs one value per Reduce function. Examples of MR can be seen in distributed Grep, where the map function outputs a line if the pattern is matched and the reduce function copies the supplied data to the output, and URL access frequency count, where the map function processes logs of web page requests an outputs (URL, 1) and then the reduce function adds together all values for the same URL to get the total count. For the implementation of MapReduce, the library in the user program first splits the input files into M pieces and starts up many copies of the program on a cluster of machines. One copy is designated to be the master, who assigns M map tasks and R reduce tasks to the remaining worker copies. The workers assigned map tasks read the contents of the input split, parses key/value pairs out of the input data, and passes each pair to user-defined Map function. Intermediate key/value pairs are buffered in memory, and periodically written to local disk and partitioned into R regions, the location of which is passed to master who is responsible to forward the locations to the reduce workers. The reduce workers use remote procedure calls to read the buffered data from the local disk of the map workers and sorts the intermediate keys to group the occurrences with the same key together. The reduce workers then iterates through the sorted intermediate data, passes the key and set of values through the user’s Reduce function, and appends the output to the final output file of its reduce partition. The master wakes up the user program after the map and reduce tasks are all completed.
When a worker fails, the map or reduce task in process on the failed worker is reset to idle and becomes ready for rescheduling. Completed map tasks must be re-executed on failure because the output is stored on local disk, but completed reduced tasks do not need to be re-executed because their output is stored in a global file system. MR is resilient to large-scale worker failures, because the MR master can just re-executed the work done by unreachable worker machines while continuing to make forward progress. If the master fails, which is unlikely because there is only one master, the MR computation aborts and clients can retry the operation if wanted. In the experimental results, the grep execution can be accomplished in 150 seconds, and there is overhead from startup because of the propagation of the program to all worker machines.
Strengths of the paper
I read this paper a couple weeks back because my project for this class is to implement MapReduce for bulk loading in MySQL. I enjoyed reading this paper because it provides many simple examples of applications for MapReduce, and it clearly lays out the steps for implementing MapReduce, which is very applicable for my project. I found it interesting to learn that MapReduce is very fault tolerant, since it can rerun failed procedures while continuing to make progress.
Limitations of the paper
I would’ve liked to see how the performance in the Grep example with MapReduce compares to performance of grep without MapReduce. The paper uses grep as the real world example, but grep uploads directed into HDFS, so I would’ve also liked to see how MapReduce performs on task sets that need to manually split the value by the delimiter character into an array of strings.
MapReduce is a concept that was introduced to allow cluster computing on commodity machines and allow high scalability along with fault tolerance through a simple interface. The system was originally developed for Google's cluster system in 2004. The authors motivate their research by discussing the issues of parallelization, distribution of data, handling failures and complex code. The paper discusses the overall architecture and how jobs typically execute in the MapReduce framework. The authors then talk about how failures are handled and how data is distributed by MapReduce algorithms.|
This paper makes a strong contribution in allowing jobs to be scheduled and reliably and efficiently completed using large clusters of commodity hardware. On page 2 the paper uses examples to motivate use of this type of system and shows some clear applications. Handling of faults is well explained and although master failure could cause problems they explain convincingly that this is unlikely. The task locality allows more efficient data distribution. The performance section makes an observation about the types of workloads that MapReduce was used for at the time and provides several graphs describing the types of run times that can be achieved. This architecture had been thoroughly used in the large scale system that is Google before publication which proves its use in real world system.
This paper has a few drawbacks involved in the types of jobs it can handle. This paper appeared in 2004 and with the increasing interest in data analytics and machine learning it may have been true at the time that the example applications in Section 5 made up a large subset of how real programs run with MapReduce, but it is not true now. Jobs that reuse the same dataset, a working set, could operate much more effectively if memory was cached on individual nodes. Additionally, map tasks do not need to write output to files, as they can be streamed to reducer workers to reduce latency.
Part 1: Overview|
This paper presents a great programming model which is naturally parallelizable and suitable for big data processing. During the map function, data would be partitioned to key-value pairs and stored locally. While in the reduce function, key-value pairs would be merged and final results would be computed using entire global data. A very straightforward example would be word count in a huge number of documents. Key-value record of word-document can be created in the map function and then be collected in the reduce function and finally the count of each word can be calculated. Notice all processes can be done in parallel. Reduce function can start even when map function is not finished.
Map function would take input raw data however in the format of key-value pairs and create another set of key-value pairs. The MapReduce library would automatically collect all intermediate records and group them by the keys. Typically reduce function just generate one or two outputs. Examples include distributed grep, count of URLs, reverse web link graph, term vector per host, inverted index and distributed sort, etc.
Part 2: Contributions
As big data analysis thrives in the industry, we badly need some programing model which is suitable for processing huge data. This highly parallel mechanism comes out just in time.
Locality optimization is highly valued. Network bandwidth is expensive resource. MapReduce mechanism fully utilizes local computation in the map phase.
They implemented map reduce algorithm on cluster of thousand of computers. The whole system is highly scalable and helps solving big data problems encountered by Google.
Part 3: Drawbacks
There is no support for atomic two phase commits of multiple output files. Some concurrency control would be taken into consideration.
In recovery of some worker’s failure, all reducers would be noticed and may need to redo some work. For example, if there is some mapper down in the middle of generating result key-value pairs, all reducers should ignore the partial pair data produced by this mapper and thus increase overhead in communication and scanning. This is some tradeoff with parallelism.
The paper discusses a parallel programming model called MapReduce. MapReduce provides a simple parallel programming interfaces. Programmers need to write map and reduce functions. The execution of programs written these ways is automatically parallelized and executed on a large cluster of commodity machines. MapReduce tasks are divided into a number of map and reduce tasks and managed by one special process called master. Each map tasks process part of the input data and produce intermediate data. The reduce tasks read remotely the intermediate data from the mappers and aggregate the result. There are different functionalities provided by MapReduce for an efficient and reliable execution of applications in a cluster of commodity machines. |
One of the functionality provided by MapReduce is fault tolerance. In a low cost cluster of commodity machines, machines running either map tasks or reduce tasks frequently fail. In such cases, the MapReduce framework re-execute those tasks in another healthy machine in the cluster. In addition, the MapReduce framework leverage the GFS file system to schedule map tasks to be local to the data. This avoid extra communication and conserves network bandwidth. Furthermore, some machines might be slow and take long time. In this case the framework start copy of tasks (for those which are in-progress) towards the completion of the job. Then, the framework selects those which finish faster. In addition to the fundamental functionality provided by MapReduce, there are more refinements discussed in the paper. On of the refinement is that user can define a combiner function, which aggregates intermediate data at the mapper side. This conserve network bandwidth as only aggregated data will be sent to the reducer.
The main strength of MapReduce is that it allows user to use their low cost commodity machines to perform large-scale data processing. This allows companies which don’t have enough investment to implement specialized servers, to use their network of commodity machines for performing data-intensive applications. In addition, MapReduce hides all the messy requirement of cluster computing including: fault tolerant, data scheduling, and managing communications. By hiding these requirements, MapReduce makes programming easier. The programmers are only required to write the map and reduce functions.
The main limitation of MapReduce is that it is applicable only to those applications which can be easily expressed using the map and reduce functions. Not all applications can be written these ways. In addition, it is suitable only for batch-based processing. It is not effective for interactive and real time operations where the latency of a particular operation matter a lot.
In this paper, a famous and powerful tool, MapReduce, is introduced. The tool was initially developed in 2003 in Google. It is a programming model and an associated implementation for processing and generating large data sets. To use the tool, users specify a map function and a reduce function to re-map key/value pairs to new ones and to merge all the intermediate values. The tool is designed to be run on a cluster of machines without users worrying about the issues about partitioning, parallelism and fault tolerance problem. The library takes care of this and efficiently use the machines regardless of what kind of map and reduce functions a user might provide.
Many possible implementations are possible, but the final decision really depends on the environment. At Google, the machines they used in the clusters were cheap and could fail. But those machines were connected in a local network, where the bandwidth was large but still scarce. It was a single master, multiple workers setup.
Some optimizations were mentioned in this paper for Google's implementation. The one I like most is the concept of "Backup tasks", which is used to accommodate for slow machines and help the task get finished faster at the end.
I really enjoy reading this paper. The arguments and designs are straightforward. The design choices are supported by real world examples. They even provided a complete sample code at the end of the paper, which demonstrated how easy it is to use MapReduce.
As far as I know, the paper doesn't provide parallelism index or speed up of the examples in it, which is very important to evaluate the efficiency of this parallelism library. Also, there is no future work mentioned in this paper. Some questions should be considered, like what will be the bottleneck when machines available are better and faster, how would newer versions of MapReduce takes advantage of those improvements.
This famous paper presented a restricted programming model called mapreduce, which handles parallelism almost automatically. They are inspired by functional programming, and each mapreduce program consisted of map and reduce stages. Map stage will map one data entry to a key-value pair and reduce stage will combine all values of same keys. Using this programming model, they are able to build a system that does parallelism automatically. Mapreduce program runs on a cluster of commodity machines. One machine will be the Master and assigns jobs to other machines. Each other machines could receive Map or Reduce jobs according to Master’s schedule. |
Since they are running on a cluster of commodity level machines, there are other design issues need to be considered. I feel two of them are most important. One is Fault tolerance. They deal with this problem by re-computation in case of worker machine failure and ignore master failure. The other is Locality. The Master will use information retrieved from GFS to minimize data movement.
A few possible refinements are also discussed. I thing the most important one is combiner. A combiner could be used to combine local key-value pairs, thus further decrease network usage.
I think the biggest contribution is that is borrowed the idea from functional programming and implemented a real system that actually shows the power of functional programming like programs in data manipulation. By doing this, they solved part of the problem of parallelism, which will help a lot in big data era.
What I like the most of this paper is its simplicity. The idea is quite simple once understood. Everything seems to be straightforward, but also powerful.
Although the idea of Mapreduce is simple and powerful, there are a few weaknesses. The most significant one is the limitation of two stages function. There could only be two stages in a mapreduce function and the two stages could only be map and reduce.
MapReduce is a programming model developed by Google built specifically for large dataset to run on a distributed system spanning thousands of nodes. |
The programmer is expected to write a map and reduce function that will be applied by the master across multiple machines. The map function processes a key/value pair to generate a set of intermediate key/value pairs and the reduce function pulls these values from the nodes processing the intermediate values and merges all the values associated with the same intermediate key. The data to be processed completely managed by GFS, is stored on local disks therefore helping them reduce the latency that is introduced by network issues.
One of the biggest contributions of this model is that it is able to leverage the processing power of multiple nodes and the programmer does not have to write code specifically to parallelize the process, since that is handled by the model. One of the other things I found interesting, keeping in mind the importance of sequential execution from our previous lectures; is that if the map and reduce operators are deterministic functions of their input values, they claim that the output from the distributed system is exactly the same as that by a non-faulting sequential execution of the same program. I also liked the idea where there is the option where MapReduce library detects which records cause deterministic crashes and skips these records, in order to continue with the execution.
One of the disadvantages as pointed out in the paper “A comparison of approaches to large scale data analysis” is that the programmer is expected to know the schema of the data they are handling based on the map reduce functions written by the previous programmer. Though it is overall simple to write this function, it would possibly be difficult to be able to understand the schema structure based on code written. I am also not sure how this will possibly handle complicated requirements where the lack of integrity constraints may be an issue. Maybe this is why, this kind of model is specific to analytical data queries or aggregate queries where accuracy is not the key requirement.
MapReduce: Simplified Data Processing on Large Clusters|
In this paper, the new programming model for simplified data processing called Map-Reduce is introduced. This new model provides simple and powerful interface for automatic parallelizing and distributing large-scale computations. The initial problem that inspires google to develop this novel data processing idea is that there are large amount of raw data from the web, like web document crawling, url request analysing and inverted index calculation for search engines. And to attack those large scale problems in an efficient way, the MR process is explained as follows:
1. assigning one job master, M Mappers and R Reducers
2. master split input file into M pieces for Mappers and split intermediate file in R pieces for Reducer
3. Mapper take in raw (key, value) pairs and emit intermediate output as new (key, value) pairs
4. Once the assigned intermediate file for a reducer completes mapping stage, the Reducer sorts them by keys and the run the reducing function on each key, each reducer will write its output into a single shard
Map-Reduce is becoming more and more popular because of following strength it carries:
First of all, it is efficient in execution of the large scale computation with automatic parallelization. As shown in the performance test, the grep scans through 10^10 100-byte records in less than 150 seconds.
Secondly, with single master and data partitioning algorithm, MR made quite a good trade-off between fault tolerance and system efficiency. In terms of machine failure, only the mapper job needs to be redo which is fast( no sorting needed). Even large amount of machines fails, the whole program doesn’t need to aborted.
Thirdly, flexibility. There is basically no requirements on the input file types, unlike many other DBMS, only a few data types are supported. In MR, as long as you can map the input file into (key, value) form, the system will take it.
However, there are still some weakness in this paper: First of all, the hardest part in the MR algorithm is how to manage the sharding for the intermediate files. If the master pre-assign the reducer key range for each reducer, then there will have possible load balancing problem between reducer machines, which is not explained clearly in this paper. Secondly, because of the interface is so simple, the degree of synchronization between machines are limited by the design itself. There is no way of running a reducer program which uses shared data structure, for instance, algorithm runs on global shared states.
This paper talks about MapReduce, which is a programming model that can be automatically paralyzed and executed on a large cluster of computers. It can processing and generatiing large data set int a parallel, fault tolerant and load balanced manner.|
There are two steps for Mapreduce:
(1) The map function process an input key/value pair to generate a set of intermediate key/value pairs. Then the library groups together the intermediate values with same key and pass to the Reduce function.
(2) The Reduce function receive an intermediate key and a list of values with that key. It merges together these values to form a smaller set of values.
The task is executed on a cluster of computer, one of them is master. The master stores metadata of the task and workers and do the scheduling. The input data is partitioned into M pieces and the map function read from these pieces and generate intermediate result to local buffer and periodically write to local disk. The reduce workers read the data from the disk of map workers and store the result into output files.
The master pings workers periodically and thus know which worker is not available. It will ask other worker to re-execute the part of the “dead” worker completed. MapReduce master considers the location information of the input files and attempts to schedule a map task on a machine that contains a replica of the corresponding input data to reduce the communication.
In the end, the paper provide the performance evaluation of Mapreduce of grep, sort and the effect of backup tasks and failures.
This paper talks about MapReduce which is a very important technology in big data. It is a parallel and distributed computing model and the paper show how does this model works, how to use the map-reduce library and consideration of this model like fault tolerance and locality optimization.
(1) When there are a lot of interleave of intermediate data to compute the final result, it will be much complex and expensive.(Hard to parallel problem)
(2) For the reduce phase, there are a lot of communication.
The authors present the MapReduce programming model for running computations on large datasets. Processing large datasets that are spread across hundreds or thousands of machines requires balancing load, parallelizing computation, distributing data, and tolerating faults. To unify many existing systems which solve this problem, the authors present MapReduce, a programming model that hides much of the work involved in running such computations and scales to large data sizes.|
The authors draw inspiration from functional programming models. By having users run all computation within map and reduce functions, the framework is easily able to distribute computation and provide fault-tolerance on commodity hardware.
MapReduce has demonstrated its usefulness by providing a scalable, unified system on which to build distributed grep, process logs, create web-scale graphs, identify important terms, generate inverted indexes, and run distributed sort among other applications.
The MapReduce framework is quite useful because of its simplicity. A user needs to only provide map and reduce functions (and optionally a combiner and partitioner for increased efficiency). However, MapReduce is not well-suited for all workloads. Not all data processing fits the map then reduce model of computation, and MapReduce is especially ill-suited for iterative and pipelined computation due to the fact that it dumps all data to distributed filesystem at every stage.
This paper introduces MapReduce as a programming model that can run on a large cluster of machines to improve the run time of processing large data sets. With larger and larger data sets that companies have, we need to have a way to process all of that information as efficiently as possible. MapReduce brings a new useful algorithm for this purpose.|
Essentially, MapReduce is broken up into two parts: mapping and reducing. During the mapping phase, the algorithm will take a key value pair and produces an intermediate key value pair. All of the same intermediate keys will be combined together to form a list of values for that key, which is passed to the reduce part of the algorithm. Reduce will then merge all of the mapped intermediate keys to form a final set of values.
In the implementation in the paper, we will have a master program that will assign work to each of the workers. All the other programs are going to be workers that will either map or reduce. All the mapping workers will store the intermediate files on its local disks and the reduce workers will read the files from the mapping workers.
Since this is running on a distributed system, we need to take into account failures for both the workers and the master. If a worker fails, the master will try to ping it and fail. In this case, the task that was run on that machine will be rerun by another machine. Since the chance of the master failing is minimal, we are just going to stop the current MapReduce job. Furthermore, because we don’t know how long jobs are going to take, we need to create more threads than workers so that we will minimize the chance of having a straggler.
Overall, this paper does a great job of introducing MapReduce and explaining why it is needed in the industry today. However, I have a few concerns about the paper:
1. The paper explains that there should be more jobs than workers, but it never tested the ratio to see what the optimal would be. I would have liked to see more experiments with different setups to see how it affects the runtime.
2. The implementation for MapReduce is on a shared nothing system. Would the implementation be any different on a shared memory system? If so, which is better and why?
This paper discusses Google’s MapReduce library. MapReduce was designed in order to simplify the task of writing code for parallel execution in a distributed system, which may otherwise be a daunting task for programmers with little experience in such a framework. Many programs, such as sorts and inverted indexes, are quite simple to code for a single machine, but operate on such large datasets that parallelization is required in order for tasks to finish in a reasonable amount of time. The code required to manage data in a distributed system can become very complex very quickly. With MapReduce, Google created a library with a simple API that handles parallelization and fault tolerance for the programmer.|
The key components of MapReduce are the Map and Reduce methods. Map takes an input key-value pair and produces a set of intermediate key-value pairs. All intermediate pairs that share a key are sent to the same Reduce function, which converts these intermediate pairs into a set of output values, which are often much smaller than the input. This framework can be used to easily write programs for a host of applications, ranging from distributed pattern search, to counting URL access frequency, to creating inverted indexes.
The fault tolerance model for MapReduce is fairly simple. The master server stores the machine ID for each map and reduce task as well as the task’s state, which may be idle, in-progress, or complete. The master pings each server frequently, and, if it does not receive a response, it switches the state of any in-progress tasks on that machine to idle, which allows them to be rescheduled for execution on another server. Experiments showed that this method of fault tolerance is fairly efficient, showing only a 5% increase in runtime for a cluster in which 200 out of 1800 servers were manually killed while in the middle of execution.
This paper has many strengths, but I would have liked to see an example of code written with MapReduce compared to the equivalent code written without the MapReduce framework, perhaps in a second appendix. The primary reasoning for this paper is that writing code for parallel and distributed systems is difficult for users with no experience doing so, but the same can be said of many things in the programming field. Multi-threading is a difficult concept at first, but many programmers need to learn to do it. I have no idea how complex the code for a parallelized application would be, so it is hard to fully appreciate the contribution that MapReduce makes without seeing a side-by-side comparison of code written with and without the framework.
This paper introduces MapReduce, which is a programming model and an associated implementation for processing and generating large data sets. Since many modern databases are distributed and parallel on clusters, it is convenient to provide a run-time system that takes care of the details of various work, such as partitioning the input data, scheduling the program’s execution, handling machine failures, and managing the inter-machine communication. Therefore, the authors implemented MapReduce to achieve this goal.|
First, the paper introduces the model of MapReduce. The model contained two functions written by users, Map and Reduce. The Map function takes an input pair and produces a set of intermediate key/value pairs. Then, the MapReduce library groups all intermediate values associated with the same intermediate key and passes them to the Reduce function, which merges together values to form a possibly smaller set of values.
Second, the paper talks about the implementation of the MapReduce model and its refinement. MapReduce uses master data structure, in which the master keeps several data structures for the clusters, including the state and the identity of the worker machine. In addition, since the data is distributed across many machines, MapReduce library has to tolerate machine failures gracefully, such as worker failure and master failure. For model refinement, the paper talked about the data partitioning function, ordering guarantees, local execution, and so on. These refinements can make the model more efficient.
The strength of the paper is that it provides a clear description of the MapReduce programming model, including the motivation, introduction, implementation, and the performance. It is good for readers that have interested in parallel and distributed databases to study.
The weakness of the paper is that it has few examples to help illustrate its ideas. I think it would be better to provide several related examples when talking about the implementation or the structure of the MapReduce model. It would make readers more clear about the ideas.
To sum up, this paper introduces MapReduce, which is a programming model and an associated implementation for processing and generating large data sets. It hides the details of parallelization, fault-tolerance, optimization and load balancing for the users, and has been successfully used at the company for many purposes.
This paper is an introduction to the MapReduce interface and programming style used for large scale parallel computation. MapReduce was created by Google and faced many of the same problems GFS did, where it is useful for very large scale problems that must be capable of handling system failures. |
The high level overview is a task will be split into a map phase, that takes input files and produces key, value output pairs, and a reduce phase, that takes in the map key, value output pairs and combines them for the final output. When a MapReduce task starts one machine will be assigned as a master, and other workers will be assigned to the mapping portion and to the reducing portion. When the mapping portion is done the output is written to a local disk so as to not avoid network clutter. I thought this was an interesting and important thing they pointed out. They also mentioned that GFS is being used to save these files which I thought was cool. The reducing phase then gets the intermediate output and the reducing workers work on that when it becomes available. The Master machine is responsible for coordinating all of this.
A significant proportion of this paper was dedicated to failure of machines. To summarize it, the master pings each worker frequently and if a machine fails it will assign its tasks to another machine to do and continue on without a problem; this is expected to happen relatively frequently. If the master fails they are able to log its progress and restart from where it left off, however they did not do that in their initial implementation they just aborted the task and would restart it. This is because being that there is only one master it’s chances of failure are relatively low. Another issue they discussed that isn’t quite failure is a straggling machine, that is taking longer than it should. The solution for this was as the task completion starts getting near with either the map or reduce phase, start duplicating tasks across machines so that if there is a straggler another machine will have finished it’s work quickly.
The results section of this paper was focused on how quickly MapReduce works given backup tasks and failures. The results were very impressive and in one task where the built in 200 failures of the 1746 worker machines the task only slowed down by ~50 seconds. I thought this was very very impressive.
1.) I would have liked to see a comparison to how long a task takes with MapReduce and how long that task would have taken on the same machines before MapReduce was created; so basically how much MapReduce improved the pre-existing methods.
2.) I would have liked a discussion about what types of programs can not be used as a MapReduce program. They did a great job of providing examples of programs that could but I know there is a problem now where people are trying to use MapReduce for things that really shouldn’t be MapReduced and I would have liked a discussion about what MapReduce is bad at in the paper.
Overall, this was a good paper and very influential in the industry. It was definitely a good read.
Map-reduce is inspired by the rising need of large data processing, especially in the case of distributed systems. The amount of applications that require handling and summarizing/evaluating massive amounts of information is steadily increasing, as well as the need for a parallel, fault-tolerant system that is able to do so efficiently while providing a clean layer of abstraction for programmers to interface with.|
The key idea for map-reduce is that it relies on the fact that many tasks are, using terms from a previous paper, “distributive,” aka commutative in evaluation. An example for this is word count occurrences; word count at any point during the scans of any text corpus can be summed up and achieve the same result in the end. Thus, each key, value pair of data is “mapped” to an intermediate value, and these intermediate are “reduced” (i.e. aggregated) in order to effect the desired result. The architecture of the system resembles the intuition of GFS; i.e. use many smaller-scale machines to accomplish a larger task (and try to have machines that are assigned a “map” task be physically close to the actual data). One naive implementation I thought was interesting was the detection of particular key-value failures and skipping them upon subsequent executions (this may bring up some issues in dealing with vendor-specific data).
While this is a very useful and elegant library, it is important to note that the concept of commutativity itself is not new or novel; I think the important part of the paper is making such an architecture available for distributed storage systems, and making it very accessible to developers who would like to use it. One drawback I can see is that, in the way that the paper describes it, reduces cannot execute until all corresponding maps have completed. That is, a single slow machine can bottleneck the entire process, but since things are being executed in parallel, this may not incur such a drastic expense as one would expect.
The paper talks about MapReduce, a programming model and an associated implementation for processing and generating large dataset. The idea came up because in Google, although most of the computation processes are straight forward, they consist of large amount of raw data and the computation has to be distributed across hundreds of machines. |
The paper explains the programming model of MapReduce, which is simple: it takes a set of input key/value pairs and produces a set of out key/value pairs. Map, written by the user, takes an input par and produces a set of intermediate key and passes them to reduce function. Then the Reduce function merges together the values into a smaller set of values. Examples of programs that use MapReduce are: Distributed Grep, Count of URL access frequency, Reverse Web-Link Graph, Term-Vector per Host, Inverted index, and Distributed Sort. Next, the paper explains the implementation. Data input is split into 16-64 MB split each, then copies the program into cluster of machines (workers), with one copy of the program being the task coordinator (master). The coordinator then assigns a map task read to the workers, which read the corresponding data input split. The result of the map task is buffered and stored into worker’s local disk, whose location is informed to the master. The master informs the location to other workers who are assigned reduce task. These workers work on the reduce task and produce output file containing the reduce result. The rest of the section discusses the data structure, fault tolerance (worker failure, master failure, semantics in the presence of failure), locality, task granularity, and backup tasks. Next, it discusses refinements that is done to MapReduce computation, including partitioning function, ordering guarantee, combiner function, input & output types, skipping bad records, side-effects, local execution, status information, and counters. Next section shows the performance of MapReduce in Grep and Sort program. Next, it discusses how MapReduce is implemented in Google and how it seems to work with many computations needed by Google. Last, it discusses related work.
The main contribution of this paper is that it proposes a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PC. It also helps that the paper is written very clearly.
However, I also wish the paper would explain more about bad records. Bad records are skipped, but is there any detection when the number of bad record are considered too large as to influence the overall computation result? Also, I think the paper still does not consider if user may want to reuse the data. Does it mean that the data will have to go through all the Map/Reduce process again? Wouldn’t this take a lot of space in the local disk? And how the system prioritizes the process (between re-computation and new computation)?--
The purpose of this paper is to introduce a new library and accompanying framework to assist users in writing distributed and/or parallel queries for very large datasets. They present a highly-scalable framework that is user-tunable and fault tolerant. |
The technical contributions of this paper are numerous. MapReduce is now widely used in large distributed systems and provides an ease of use that was missing before it was introduced. The paper presents this MapReduce framework that relies on two user-specified functions to compute queries: a mapper and reducer function. The large datasets that these functions are used on to aggregate values are processed on a system with many worker nodes that work with partitioned sets of the data. After presenting examples and the basic structural breakdown and implementation, the authors discuss several very useful additions that they have made to the system to increase functionality and allow further user tunability.
I think one of the main strengths of this paper is the number of examples they present. For readers not very familiar with the concepts in functional programming languages that MapReduce is modeled after, these examples are very instructive and enlightening. Additionally, as mentioned above, the paper presents many additional extensions to their system that are logical and at least seem useful. I also appreciate the real-world examples included that detail where within Google MapReduce is actually utilized.
As far as weaknesses go, I am disappointed in the amount of experimental results included in this paper. They do examine some specific example cases, but I was hoping for a more thorough evaluation of the system and performance improvements it has perhaps over either existing systems or user-written code that is not as optimized for parallel and distributed systems. I also would have liked to see some tradeoffs between the parameters that they mention (mostly the number of mappers vs number of reducers) rather than just the table that talks about what the most commonly used numbers are in practice.
Review: MapReduce: Simplified Data Processing on Large Clusters|
This paper describes an implementation of the MapReduce algorithm, which is a programming model and implementation for processing and generating large data sets. The model includes a mapping function and a reduce function that are provided by the users and the goal is to distribute the computation, which are usually trivial but heavy due to the large data amount, to many clusters. The advantages of the proposed model are the simple yet powerful interface that enables automatic parallelization and distribution of large-scale computations as well as its implementation.
The paper is very clear and detailed in describing the interface. More importantly the design of the implementation is a more valuable strong point of the work. The nice part of the implementation is that it takes the particular situation, which is the Google server, into account and makes trade-offs accordingly to optimize performance.
There are also some issues of the paper. For example it does not well justify the act that it aborts the handling of master failure. In the paper it argues that this can be omitted because there is only one master. However it doesn’t mention if that is the common case in similar applications and also, what if it’s not. In the experiment section where performances are measured and compared, it would make a more convincing demonstration if the author provided some information about the state of the machines. I.e. as mentioned in the early part of the paper, the number of machines is very large so machine failures are inevitable. But it is not promised that the failing rate will be consistent when running comparisons. Besides, in the paper it also mentions that the machines were not monopolized by the experiments in this paper, which means there could be other jobs slowing down the machines at randomly time during the experiment. That also adds more factors that damages the credibility of the experimental results.
This paper introduces MapReduce, a programming model which could be used to process or generate large data sets on large clusters with automatic parallelization and transparent fault-tolerance and load-balancing.|
This paper introduces MapReduce, a parallel programing model associated with a large scale fault tolerance implementation that could process large amount of data and is very easy to write from the client side.
The programming model was inspired by the functional language interface like LISP Map and Reduce operations. It's a general high level abstraction that is both easy to use for almost any task and very easy to scale with the google implementation of MapReduce framework. Google build the system so that clients who use MapReduce only need to write Map function, which is a function that would provide the intermediate key-value data, and a Reduce function which would take the intermediate result and group them with key and process the value to provide the final result. User need not to worry about the distributed system details. Those have been taken care of by the MapReduce system. This paper also mentions lots of optimization, for example the failure handling, the use of GFS, the private file and global data structure maintained by the single master. Moreover, this paper also talked about the advanced optimization and extensions to the original MapReduce functionality for performance. At the end of this paper, they also talk about the effectiveness of MapReduce by evaluating multiple programs under different configurations.
1. This paper introduces MapReduce, a programming model that is highly scalable and easy to write and also can be used to implement most of the real word algorithms.
2. This paper also introduces the distributed system that the MapReduce programming model is running on, and gives a detail introduction of the detailed implementation along with the design consideration.
3. This paper shows several ways on how to use the MapReduce model to implement several functionalities, and also the performance of them, which is both very impressive and helps its reader to understand the usage of MapReduce.
1. Although it is true that MapReduce can implements many general functionalities very fast and efficient, it has its limitation, for example many machine learning algorithms which needs several iterations may take a long time to converge since it need to communicate with the GFS to store the intermediate results.
2. In the evaluation part, the paper only evaluates the performance under different configurations without comparing the result with others. I believe it going to be more convincing if the user can provide some additional comparison with others.
This paper introduces MapReduce, a programming model and an associated implementation for processing and generating large data sets. It exploits the inherent parallelism in the word load to split it into multiple independent sub-tasks that can be executed simultaneously. The MapReduce consists of two stages: the Map stage that reads data from distributed file system and performs filtering or transformation, the Reduce stage that aggregates the shuffled output from Map stage. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. |
The MapReduce is a simple programming model that abstracts away the management of a parallel and distributed system. This paper implements the MapReduce model for a cluster of commodity machines.
The main advantages of MapReduce include:
1. Fault tolerance. In this model the master monitors all the workers periodically. It restarts the jobs from the failed workers and reassigns tasks to any free ones. In a typical cluster the failure of a single node is very common. Fault tolerance is a crucial feature in distributed computing system. In the MapReduce model the failure of a single node have very little overhead to the overall system.
2. Flexibility. User can design any kind of map and reduce functions to perform various of analysis tasks. A pipeline of map and reduce functions are constructed to achieve sophisticated task.
3. Data locality. The scheduler could make the input files local to the mapper and the intermediate results local to the reducer.
One weakness is that the MapReduce is very I/O intensive. The result of map functions are materialized, thus involving heave disk I/Os. Moreover, in a sequence of MapReduce jobs, the data of previous round is always stored in disk, though it will soon be retrieved from disk for next round of computation.