Hadoop MapReduce Interview Questions & Answers (2026)
The most commonly asked Hadoop MapReduce questions at data engineer interviews — with clear answers, real examples, and interview tips to crack your next Big Data role.
📋 What You'll Learn
Hadoop remains a foundational technology in Big Data pipelines. Even with Spark dominating batch processing, interviewers still test your Hadoop fundamentals — especially for senior and lead data engineer roles. Here are the questions you must know.
1. MapReduce Core Concepts
1. Input Split — Input data is divided into fixed-size chunks (default 128 MB each).
2. Map Phase — Each split is processed by a Mapper which emits
(key, value) pairs.3. Shuffle & Sort — All values for the same key are grouped and sorted before reaching the Reducer.
4. Reduce Phase — Reducer processes each key with its list of values and writes final output.
5. Output — Results are written to HDFS.
Interview tip: Always mention the Shuffle & Sort phase — many candidates forget it, yet it's the most expensive step.
When to use it: Only when the operation is both commutative (order doesn't matter) and associative (grouping doesn't matter) — like SUM, MIN, MAX, COUNT.
When NOT to use it: For AVERAGE — a local average of averages is not the same as the global average. Use SUM + COUNT separately instead.
# Word Count — Combiner is safe here Map output: (word, 1), (word, 1), (word, 1) After Combiner: (word, 3) ← less data over the network Reducer gets: (word, [3]) ← instead of (word, [1, 1, 1])
HashPartitioner uses hash(key) % numReducers to distribute keys evenly.Custom Partitioner use case: If you want all records for the same region or category to go to the same Reducer (for sorted output or reporting), you write a custom Partitioner.
public class RegionPartitioner extends Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text key, IntWritable value, int numReduceTasks) {
if (key.toString().startsWith("US")) return 0;
if (key.toString().startsWith("EU")) return 1;
return 2; // rest
}
}
RecordReader — Reads the actual records from a split and converts them into key-value pairs for the Mapper.
Common InputFormats:
TextInputFormat (default, one line = one record), SequenceFileInputFormat, KeyValueTextInputFormat, NLineInputFormat.
1. Partitions map output by key (using the Partitioner).
2. Spills data to local disk when the in-memory buffer (100 MB default) fills up.
3. Merges multiple spill files into sorted, partitioned files.
4. Copies the relevant partition to each Reducer node over the network (this is the actual "shuffle").
5. Sorts all received data by key before passing to the Reducer.
The shuffle phase is network-intensive and is often the main bottleneck in MapReduce jobs.
2. HDFS Architecture Questions
DataNode (Worker): Stores the actual data blocks. Reports block health to the NameNode via heartbeats every 3 seconds.
Single Point of Failure: In Hadoop 1.x, NameNode failure meant total cluster failure. Hadoop 2.x introduced HA NameNode with Active/Standby setup using ZooKeeper.
• 1st replica — same node as the writer
• 2nd replica — different rack
• 3rd replica — same rack as 2nd but different node
This balances fault tolerance with network efficiency. If one DataNode fails, data is still available on 2 other nodes.
Why so large? To minimize seek time as a proportion of transfer time. With large files, disk seek time is negligible compared to transfer time. Large blocks also mean fewer metadata entries in the NameNode, reducing memory pressure on the master node.
Gotcha question: "What if a file is 50 MB?" — It still takes only one block. HDFS does NOT waste the remaining 78 MB on disk. The block only uses the actual file size.
3. YARN & Resource Management
NodeManager (NM): Runs on each worker node. Reports resource usage (CPU, memory) to the ResourceManager and manages containers on its node.
ApplicationMaster (AM): One per application. Negotiates resources with the ResourceManager and works with NodeManagers to run and monitor tasks.
Container: The unit of resource allocation — a bundle of CPU + memory on a specific node.
| Scheduler | Best For | Key Feature |
|---|---|---|
| FIFO | Dev/test clusters | Simple queue, first-come-first-served |
| Capacity | Multi-tenant orgs | Dedicated % of cluster per team/queue |
| Fair | Mixed workloads | Dynamic sharing — idle resources lent to others |
4. Performance & Optimization Questions
Common causes: Highly frequent keys (e.g.,
NULL, popular categories), poor partitioning logic.Fixes:
• Use a Salting technique — append a random suffix to keys before Map, then strip it in Reduce to re-group.
• Use a custom Partitioner to redistribute heavy keys.
• Use a Combiner to reduce data volume before shuffle.
When to DISABLE it:
• When tasks write to external systems (databases, APIs) — duplicated writes cause data corruption.
• When tasks are intentionally slow (large I/O, machine learning training).
• On heterogeneous clusters where slower nodes are expected.
mapreduce.map.speculative=falsemapreduce.reduce.speculative=false
1. Reduce input data — Use compression (Snappy, LZO), columnar formats (ORC, Parquet).
2. Add a Combiner — Reduce shuffle data volume.
3. Increase block size — Fewer splits = fewer Map tasks overhead.
4. Tune memory settings —
mapreduce.map.memory.mb, mapreduce.reduce.memory.mb.5. Use compression on intermediate data —
mapreduce.map.output.compress=true.6. Fix data skew — Salting or custom Partitioner.
7. Reduce number of Reducers — Each Reducer requires a merge+sort; fewer is sometimes faster.
5. Scenario-Based Questions
• Mapper: parse each log line, emit
(url, 1).• Combiner: locally sum counts.
• Reducer: sum all counts per URL → emit
(url, total_count).Job 2 — Find Top 10:
• Mapper: swap key and value → emit
(total_count, url).• Set 1 Reducer with descending sort order.
• Reducer: output first 10 records.
Bonus point: Mention that Spark would do this in a single job with
reduceByKey + top(10), making it faster for this use case.
1. Check YARN ResourceManager UI → find the stuck task → check which node it's on.
2. Check if it's a data skew issue — one Reducer handling a very large key.
3. Check for disk space issues on the Reducer node (spill files filling /tmp).
4. Check GC pressure — excessive Java garbage collection on that task.
5. Check for network issues — shuffle copy stalling.
Solution options: Kill the task and let speculative execution take over, or fix the skew and rerun.
🎯 Ready to Practice 300+ Big Data Questions?
Take our free interactive quizzes on Hadoop, DSA, SQL, PySpark and Networking — no signup required. Then grab the full question bank for offline study.
Take the Free Quiz → Get the 300Q PDF Bundle
No comments:
Post a Comment