Paper Summary II #
by Lei Yang(lyang423) @2023/04/01
Title: MapReduce: Simplified Data Processing on Large Clusters
Authors: Jeffrey Dean and Sanjay Ghemawat
Introduction #
Large-scale computational tasks often involve a two-step operation. The first step, the map operation, calculates intermediate key/value pairs. The second step, the reduce operation, aggregates the values of identical keys to produce the desired data combination.
The primary goal of this paper is to utilize a simple yet powerful interface that facilitates automatic parallelization and distribution, thereby optimizing computation performance on large clusters of commodity personal computers (PCs). By leveraging this approach, the paper aims to achieve high efficiency and scalability in processing complex tasks on clusters of machines, ultimately enhancing the performance of large-scale computations.
Programming Model #
The model expresses the computation as two functions: Map and Reduce.
- Map: processes input pairs, producing intermediate key/value pairs, and forward them to the reduce function.
- Reduce: receive intermediate keys and associated set of values, aggregate the values to form a possibly smaller set of values. The paper gives several examples, including distribute grep, count of URL access frequency, reverse web-link graph, etc.
Implementation #
- Overall workflow:
- Input files are split into M pieces (typically 16-64 MB) and multiple program copies are initiated on a cluster.
- One program copy acts as the master, assigning M map tasks and R reduce tasks to idle worker copies.
- Assigned map tasks read input splits, parse key/value pairs, and pass them to the user-defined Map function. Intermediate pairs are buffered in memory.
- Buffered pairs are periodically written to local disk, partitioned into R regions. The master forwards these locations to reduce workers.
- Reduce workers read and sort the intermediate data, grouping occurrences of the same key together.
- The reduce worker passes each unique key and corresponding values to the user’s Reduce function, appending output to the final output file.
- Once all tasks are complete, the master signals the user program, and the MapReduce call returns to the user code.
- The master keeps several data structures such as the identity, task, and state of each worker, and also conduit the intermediate file regions for propagating map tasks to reduce tasks.
- MapReduce ensures fault tolerance through:
- Detecting and rescheduling failed tasks on idle workers.
- Write checkpoints of master to ensure fast recovery.
- Creating temporary output files and promoting them to final output files atomically to avoid inconsistencies.
- The master also takes the location of input files into account to save network bandwidth.
- In practice, the pieces of map phase and reduce phase should be much larger than the number of workers to improve dynamic load balancing and failure recovery.
- To handle the problem of stragglers, MapReduce typically increase the computational resources by a few percent, which significantly reduce the time to complete large operations.
Refinements #
- Partition Function: In addition to the default partitioning function, the user can provide a special function to avoid imbalanced partitioning.
- Ordering Guarantees: MapReduce guarantees the order of the intermediate key/value pairs within a given partition.
- Combiner Function: A user-specified function that aggregates intermediate key/value pairs on the map worker before transferring data to reduce workers, reducing network traffic and increasing efficiency.
- Input and Output Types: Allowing users to specify custom input and output types, enabling the system to handle various data formats and sources.
- Side-effects: Tasks that produce multiple output files with cross-file consistency requirements should be deterministic.
- Skipping Bad Records: Implementing a mechanism to skip over problematic input records, enhancing fault tolerance and ensuring successful job completion despite occasional corrupt or ill-formatted data.
- Local Execution: MapReduce offers an alternative implementation of the MapReduce library that sequentially executes all the work for a MapReduce operation on the local machine for debugging, profiling, and small-scale testing.
- Status Information: Providing real-time status information about the ongoing computation, enabling users to monitor progress and diagnose issues.
- Counters: Offering user-defined counters to track global information during the MapReduce execution, facilitating debugging and optimization.
Performance #
The MapReduce paper evaluates the system using large-scale computations on clusters containing thousands of machines. Key insights include:
- Scalability: MapReduce efficiently distributes tasks, handling increasing workloads without performance degradation.
- Fault Tolerance: The system effectively handles worker failures by re-executing tasks, ensuring job completion.
- Balancing and Stragglers: MapReduce addresses stragglers with backup tasks, minimizing their impact on overall completion time.
- Efficiency: The system demonstrates competitive or superior performance compared to alternative solutions, attributed to its automatic parallelization and built-in optimizations.
Experience #
MapReduce has been used across a wide range of domains in Google. One of our most significant uses is the rewrite of the indexing system. The key benefits:
- The code that deals with fault tolerance, distribution, and parallelization is hidden by MapReduce.
- Unrelated computations are separated.
- Indexing process becomes easier to operate by leveraging MapReduce to handle the machine failures, networking hiccups, etc.
Related Work #
The paper compares MapReduce with variety of parallel programming models such as Bulk Synchronous Programming, MPI, River, BAD-FS, TACC, etc. to emphasize the key differences and similarities that MapReduce introduces, e.g., locality optimization and backup task mechanism.
Conclusions #
The MapReduce model is easy to use and can be adapted to a large variety of problems. In practice, restricting the model makes it easy to parallelize and distribute computations. Network bandwidth is a scarce resource, with some optimization, such as reading data from local disks. Redundant execution can be used to reduce the impact of slow machines.