Prev Next

BigData / Hadoop MapReduce

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is Hadoop MapReduce?

MapReduce is a parallel processing framework that processes big amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.

MapReduce works on master-slave architecture and can process a large amount of data by dividing the task into subtask and running parallel on commodity hardware. MapReduce works on key-value pair as shown below and uses the Java serialization package.

<key, value="some-value">

2. Explain The MapReduce process.

MapReduce has two main processes.

  • Job Tracker (Master).
  • Task tracker (Slave).

JobTracker Manages the overall MapReduce process in Hadoop Ecosystem. The main function of Job tracker is to manage resources and also it monitors the progress and status of the TaskTrackers.

Task Tracker processes the task assigned by JobTracker. TaskTracker also post its update to the JobTracker to notify about the status of task and the available free slots.

3. Explain the steps involved in Hadoop MapReduce Process.

MapReduce Job is complex in nature and it involves multiple steps for complete parallel execution of the job. Following steps are executed by MapReduce Job in a sequential order.

  • Mapper.
  • Shuffle and Sorting.
  • Reducer.
4. Explain Mapper in Hadoop MapReduce.

In MapReduce, task tracker presents the Mapper that provides parallelism to MapReduce job. The output of the mapper is a MAP <Key, Value>.

The mapper output is not stored in HDFS instead it is stored in Operating System Space path which is read by MapReduce framework for shuffle and sorting of the Mapper output.

5. What is Shuffle and Sorting in Hadoop MapReduce?

Shuffle and Sorting are the intermediate steps between Mapper and Reducer steps of MapReduce Job.

Shuffle process aggregates all the mapper data by grouping them into key value.

The sorting process would do the job of natural sorting based on the key.

6. What is Reducer in Hadoop MapReduce?

After the shuffling and sorting process, the processed map output is sent to Reducer for generation of final output.

Reducer aggregates the data based on the logic provided in Reducer class. Reducer consolidates and stores the final output in HDFS.

7. Explain Hadoop streaming.

Hadoop distribution has a generic application programming interface (API) for writing Map and Reduce jobs in any preferred programming language like Java Python, Perl, Ruby, etc. This is referred as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.

Hadoop Streaming utility API is part of Hadoop distribution.

Apache Spark

Comments & Discussions