Thursday, April 9, 2026

Hadoop MapReduce Interview Questions and Answers

Big Data Interview Prep

Hadoop MapReduce Interview Questions & Answers (2026)

The most commonly asked Hadoop MapReduce questions at data engineer interviews — with clear answers, real examples, and interview tips to crack your next Big Data role.

📅 Updated April 2026  |  ⏱ 12 min read  |  🎯 All Levels

Hadoop remains a foundational technology in Big Data pipelines. Even with Spark dominating batch processing, interviewers still test your Hadoop fundamentals — especially for senior and lead data engineer roles. Here are the questions you must know.

1. MapReduce Core Concepts

Q1Explain the MapReduce execution flow end-to-end.
The MapReduce flow has 5 key phases:

1. Input Split — Input data is divided into fixed-size chunks (default 128 MB each).
2. Map Phase — Each split is processed by a Mapper which emits (key, value) pairs.
3. Shuffle & Sort — All values for the same key are grouped and sorted before reaching the Reducer.
4. Reduce Phase — Reducer processes each key with its list of values and writes final output.
5. Output — Results are written to HDFS.

Interview tip: Always mention the Shuffle & Sort phase — many candidates forget it, yet it's the most expensive step.
Q2What is a Combiner and when should you use it?
A Combiner is a local mini-reducer that runs on the Mapper node before data is transferred across the network. It reduces the volume of data sent to Reducers, saving significant network bandwidth and time.

When to use it: Only when the operation is both commutative (order doesn't matter) and associative (grouping doesn't matter) — like SUM, MIN, MAX, COUNT.

When NOT to use it: For AVERAGE — a local average of averages is not the same as the global average. Use SUM + COUNT separately instead.
# Word Count — Combiner is safe here
Map output:   (word, 1), (word, 1), (word, 1)
After Combiner: (word, 3)   ← less data over the network
Reducer gets: (word, [3])   ← instead of (word, [1, 1, 1])
Q3What is a Partitioner in MapReduce? Can you write a custom one?
A Partitioner controls which Reducer receives which key-value pair from the Mapper output. The default HashPartitioner uses hash(key) % numReducers to distribute keys evenly.

Custom Partitioner use case: If you want all records for the same region or category to go to the same Reducer (for sorted output or reporting), you write a custom Partitioner.
public class RegionPartitioner extends Partitioner<Text, IntWritable> {
  @Override
  public int getPartition(Text key, IntWritable value, int numReduceTasks) {
    if (key.toString().startsWith("US")) return 0;
    if (key.toString().startsWith("EU")) return 1;
    return 2; // rest
  }
}
Q4What is the difference between InputFormat and RecordReader?
InputFormat — Decides how input files are split (getSplits) and which RecordReader to use.
RecordReader — Reads the actual records from a split and converts them into key-value pairs for the Mapper.

Common InputFormats: TextInputFormat (default, one line = one record), SequenceFileInputFormat, KeyValueTextInputFormat, NLineInputFormat.
Q5What happens during the Shuffle and Sort phase?
After Mappers complete, the framework:

1. Partitions map output by key (using the Partitioner).
2. Spills data to local disk when the in-memory buffer (100 MB default) fills up.
3. Merges multiple spill files into sorted, partitioned files.
4. Copies the relevant partition to each Reducer node over the network (this is the actual "shuffle").
5. Sorts all received data by key before passing to the Reducer.

The shuffle phase is network-intensive and is often the main bottleneck in MapReduce jobs.

2. HDFS Architecture Questions

💡 Why HDFS Questions Matter HDFS is the storage backbone of Hadoop. Interviewers frequently ask about NameNode, DataNode, replication, and fault tolerance — especially for senior roles.
Q6What is the role of NameNode vs DataNode in HDFS?
NameNode (Master): Stores the filesystem metadata — directory tree, file names, block locations, permissions. It does NOT store actual data. It runs on a dedicated, high-memory machine.

DataNode (Worker): Stores the actual data blocks. Reports block health to the NameNode via heartbeats every 3 seconds.

Single Point of Failure: In Hadoop 1.x, NameNode failure meant total cluster failure. Hadoop 2.x introduced HA NameNode with Active/Standby setup using ZooKeeper.
Q7What is HDFS replication and what is the default replication factor?
HDFS stores each block on 3 different DataNodes by default (replication factor = 3). The placement follows the Rack Awareness policy:

• 1st replica — same node as the writer
• 2nd replica — different rack
• 3rd replica — same rack as 2nd but different node

This balances fault tolerance with network efficiency. If one DataNode fails, data is still available on 2 other nodes.
Q8What is the default HDFS block size and why is it so large?
Default block size is 128 MB (Hadoop 2.x+), up from 64 MB in Hadoop 1.x.

Why so large? To minimize seek time as a proportion of transfer time. With large files, disk seek time is negligible compared to transfer time. Large blocks also mean fewer metadata entries in the NameNode, reducing memory pressure on the master node.

Gotcha question: "What if a file is 50 MB?" — It still takes only one block. HDFS does NOT waste the remaining 78 MB on disk. The block only uses the actual file size.

3. YARN & Resource Management

Q9What are the main components of YARN?
ResourceManager (RM): The cluster master. Manages all resources and schedules jobs. Has two parts: Scheduler (resource allocation) and ApplicationsManager (tracks running applications).

NodeManager (NM): Runs on each worker node. Reports resource usage (CPU, memory) to the ResourceManager and manages containers on its node.

ApplicationMaster (AM): One per application. Negotiates resources with the ResourceManager and works with NodeManagers to run and monitor tasks.

Container: The unit of resource allocation — a bundle of CPU + memory on a specific node.
Q10What are the YARN schedulers and when would you choose each?
SchedulerBest ForKey Feature
FIFODev/test clustersSimple queue, first-come-first-served
CapacityMulti-tenant orgsDedicated % of cluster per team/queue
FairMixed workloadsDynamic sharing — idle resources lent to others
Production clusters almost always use Capacity Scheduler (default in CDH/HDP) or Fair Scheduler (default in Cloudera).

4. Performance & Optimization Questions

Q11What causes data skew in MapReduce and how do you fix it?
Data skew happens when some Reducers get far more data than others — causing some tasks to finish in minutes while others take hours, blocking the entire job.

Common causes: Highly frequent keys (e.g., NULL, popular categories), poor partitioning logic.

Fixes:
• Use a Salting technique — append a random suffix to keys before Map, then strip it in Reduce to re-group.
• Use a custom Partitioner to redistribute heavy keys.
• Use a Combiner to reduce data volume before shuffle.
Q12What is speculative execution and when would you disable it?
Speculative execution detects slow (straggler) tasks and launches duplicate instances on other nodes, using whichever finishes first. It prevents one slow machine from delaying the entire job.

When to DISABLE it:
• When tasks write to external systems (databases, APIs) — duplicated writes cause data corruption.
• When tasks are intentionally slow (large I/O, machine learning training).
• On heterogeneous clusters where slower nodes are expected.

mapreduce.map.speculative=false
mapreduce.reduce.speculative=false
Q13How do you optimize a MapReduce job that is too slow?
A structured approach interviewers love:

1. Reduce input data — Use compression (Snappy, LZO), columnar formats (ORC, Parquet).
2. Add a Combiner — Reduce shuffle data volume.
3. Increase block size — Fewer splits = fewer Map tasks overhead.
4. Tune memory settingsmapreduce.map.memory.mb, mapreduce.reduce.memory.mb.
5. Use compression on intermediate datamapreduce.map.output.compress=true.
6. Fix data skew — Salting or custom Partitioner.
7. Reduce number of Reducers — Each Reducer requires a merge+sort; fewer is sometimes faster.

5. Scenario-Based Questions

⚠️ Interview Reality Senior interviews almost always include at least one scenario question. These test whether you can apply concepts, not just recite them.
Q14You have a 2 TB log file in HDFS. How would you find the top 10 most-visited URLs using MapReduce?
Job 1 — Count visits per URL:
• Mapper: parse each log line, emit (url, 1).
• Combiner: locally sum counts.
• Reducer: sum all counts per URL → emit (url, total_count).

Job 2 — Find Top 10:
• Mapper: swap key and value → emit (total_count, url).
• Set 1 Reducer with descending sort order.
• Reducer: output first 10 records.

Bonus point: Mention that Spark would do this in a single job with reduceByKey + top(10), making it faster for this use case.
Q15Your MapReduce job completes 99% and then hangs. What would you investigate?
This is a classic straggler/reducer problem. Steps to investigate:

1. Check YARN ResourceManager UI → find the stuck task → check which node it's on.
2. Check if it's a data skew issue — one Reducer handling a very large key.
3. Check for disk space issues on the Reducer node (spill files filling /tmp).
4. Check GC pressure — excessive Java garbage collection on that task.
5. Check for network issues — shuffle copy stalling.
Solution options: Kill the task and let speculative execution take over, or fix the skew and rerun.

🎯 Ready to Practice 300+ Big Data Questions?

Take our free interactive quizzes on Hadoop, DSA, SQL, PySpark and Networking — no signup required. Then grab the full question bank for offline study.

Take the Free Quiz → Get the 300Q PDF Bundle

No comments:

Post a Comment

Networking concepts of Data Engineer

Networking for Data Engineers Networking Concepts Every Data Engineer Must Know (2026) You don't need to be a n...

🚫
Content Protected
Copying content from this site is not permitted.
© 2026 InterviewQuestionsToLearn.com