Thursday, April 2, 2026

Apache Spark Performance Tuning Interview Questions and Answers

Apache Spark Performance Tuning Interview Questions and Answers (2026)

Performance tuning is where Spark interviews separate junior from senior candidates. While most engineers know the basics of DataFrames and SQL, interviewers at data-intensive companies probe whether you can diagnose a slow job, identify the root cause, and apply the right fix. This guide covers the most important Apache Spark performance tuning interview questions with detailed answers for 2026.

Before reading this, make sure you are solid on Spark fundamentals. Our Top 100 Apache Spark Interview Questions covers the full breadth of Spark concepts tested in interviews.


Why Performance Tuning Questions Matter in Spark Interviews

In production data engineering roles, you will regularly encounter pipelines that run for hours when they should finish in minutes. Interviewers want to know that you can look at a Spark UI, identify a bottleneck, and apply a targeted fix rather than blindly throwing more resources at the problem. Performance tuning knowledge signals real-world experience, not just textbook learning.


Core Apache Spark Performance Tuning Interview Questions

1. Walk me through how you would diagnose a slow Spark job.

This is the most open-ended and revealing performance question. A strong answer follows a systematic approach:

  1. Check the Spark Web UI (port 4040) — Look at the Jobs, Stages, and Tasks tabs. Find the stage with the longest duration.
  2. Identify the bottleneck stage — Is it a shuffle-heavy stage? Are there a few tasks running far longer than others (skew)?
  3. Check task metrics — Look at shuffle read/write bytes, spill to disk, and GC time per task.
  4. Look at partition sizes — Too many tiny partitions create scheduling overhead. Too few large partitions cause memory pressure.
  5. Check for skew — If the median task takes 2 seconds but one task takes 20 minutes, you have data skew on a key.
  6. Review the execution plan — Use df.explain('formatted') to see if predicate pushdown fired, if broadcast joins were used, and if any expensive sorts or shuffles can be avoided.

2. What is data skew in Spark and what are your options for fixing it?

Data skew occurs when certain key values appear far more frequently than others. During a shuffle (groupBy, join), all records for a given key are routed to the same task — one task processes millions of rows while others finish in seconds. The stage cannot complete until the last task finishes.

Solutions:

  • AQE Skew Join Optimisation (Spark 3.0+) — Enabled automatically when spark.sql.adaptive.skewJoin.enabled=true. Spark detects skewed partitions at runtime and splits them into smaller sub-partitions, duplicating the matching partition from the other side.
  • Salting — Manually append a random number (0 to N-1) to the skewed key. Replicate the non-skewed side N times with each suffix. After joining, strip the salt. This spreads one hot key across N tasks.
  • Broadcast join — If the non-skewed table is small enough, broadcast it to avoid the shuffle entirely.
  • Repartition before joining — Use repartition(n, 'key') with a higher N to spread the load, though this does not fix true key skew.

3. What is Adaptive Query Execution (AQE) and what problems does it solve?

AQE (spark.sql.adaptive.execution.enabled=true, default in Spark 3.2+) re-optimises query execution plans at runtime using actual statistics collected after each shuffle. The static Catalyst optimizer must estimate data sizes before execution; those estimates are often wrong. AQE corrects them dynamically.

Three key AQE capabilities:

  1. Coalescing shuffle partitions — After a shuffle, AQE merges many small partitions into fewer larger ones, reducing the overhead of scheduling thousands of tiny tasks. This makes the default of 200 shuffle partitions far less critical to tune manually.
  2. Converting sort-merge joins to broadcast joins — If runtime statistics show one side of a join is smaller than the broadcast threshold, AQE switches to a broadcast hash join on the fly — no shuffle needed for the large side.
  3. Skew join optimisation — AQE detects skewed partitions post-shuffle and splits them automatically, as described above.

4. What is the difference between a broadcast join and a sort-merge join?

Broadcast Hash Join: The smaller table is serialised and sent to every executor as a hash map in memory. The larger table's partitions are then probed locally against that hash map. No shuffle of the larger table is needed. Best when one side is small (default threshold: 10 MB).

Sort-Merge Join: Both sides are shuffled by the join key, sorted, and then merged. Requires a full shuffle of both tables. This is Spark's default for large-large joins. It scales to any data size but is expensive due to the shuffle.

Force a broadcast join with the hint: df1.join(broadcast(df2), 'key'). You can also increase the auto-broadcast threshold: spark.conf.set('spark.sql.autoBroadcastJoinThreshold', '50m').


5. What is the impact of spark.sql.shuffle.partitions and how do you tune it?

This configuration (default: 200) controls the number of partitions produced after any shuffle operation (join, groupBy, distinct). It is one of the most impactful Spark settings.

Too low (e.g., 10): Each post-shuffle partition holds too much data, causing memory pressure, GC pauses, and potential OOM errors.

Too high (e.g., 2000 for a small dataset): Each partition is tiny, scheduling overhead dominates, and writing to storage creates thousands of small files.

Rule of thumb: Target 100–200 MB per partition. Formula: total_shuffle_data_size / target_partition_size. With AQE enabled (Spark 3.2+), you can set this high and let AQE coalesce small partitions automatically — greatly reducing the need for manual tuning.


6. When and how should you use caching in Spark?

Cache a DataFrame when it is used more than once in a pipeline. Without caching, Spark recomputes the entire lineage from source for every action. With caching, the result is stored in memory (or disk) after the first computation and reused for all subsequent actions.

# Cache when reused multiple times
df_filtered = df.filter(col('status') == 'active').cache()

count = df_filtered.count()           # triggers computation, stores in cache
top_10 = df_filtered.limit(10).show() # reads from cache, fast

df_filtered.unpersist()               # free memory when done

Do not cache: DataFrames used only once, very large DataFrames that fill memory (evicts other cached data), or streaming DataFrames.

Persist levels: MEMORY_ONLY (default for cache()), MEMORY_AND_DISK (spills to disk if memory is full), DISK_ONLY, and OFF_HEAP (avoids GC).


7. What causes Out of Memory errors in Spark executors and how do you resolve them?

OOM errors are common in production Spark jobs. Common causes and fixes:

  • Partition too large: Increase partition count with repartition() to reduce data per task.
  • Data skew: One task processes the entire hot key. Apply salting or AQE skew join.
  • Broadcast variable too large: If a broadcast variable exceeds executor memory, reduce spark.sql.autoBroadcastJoinThreshold or avoid broadcasting that table.
  • Excessive caching: Too much cached data competes with execution memory. Unpersist unused DataFrames.
  • GC overhead: Enable off-heap memory (spark.memory.offHeap.enabled=true) or tune the JVM GC algorithm (G1GC is generally best for Spark).
  • Increase executor memory: As a last resort, raise spark.executor.memory — but always fix the root cause first.

8. What is the small files problem and how do you prevent it?

When a DataFrame with many partitions is written to storage, it produces one file per partition. A 10,000-partition DataFrame creates 10,000 files. Object stores like S3 and ADLS perform poorly with millions of small files because each file open is a separate metadata API call.

Prevention:

  • Use df.coalesce(n).write... to reduce partitions before writing (no full shuffle).
  • Use df.repartition(n).write... when you need evenly sized output files (full shuffle).
  • Use Delta Lake OPTIMIZE to compact small files into 1 GB target files with a single command: OPTIMIZE table_name ZORDER BY (column).

9. What is predicate pushdown and how do you verify it is working?

Predicate pushdown moves filter conditions as close to the data source as possible — ideally down into the file reader. For Parquet files, Spark passes filter conditions to the Parquet reader, which uses row group min/max statistics to skip entire row groups without reading them.

Verify with: df.explain('formatted'). Look for PushedFilters in the FileScan node. If your filter is listed there, pushdown is working. If not, the filter may be using a UDF or function that blocks pushdown — rewrite using native Spark functions.


Quick Reference: Spark Performance Tuning Checklist

  • Enable AQE: spark.sql.adaptive.execution.enabled=true
  • Set shuffle partitions correctly or let AQE auto-tune
  • Use broadcast joins for small tables
  • Cache DataFrames that are reused; unpersist when done
  • Avoid UDFs — use built-in Spark functions instead
  • Use Parquet or Delta Lake, not CSV or JSON
  • Monitor the Spark UI — stages, tasks, shuffle metrics
  • Coalesce output files to avoid the small files problem
  • Apply salting or AQE for data skew
  • Use df.explain() to verify execution plan optimisations

Frequently Asked Questions

Q: What is the most impactful single change for Spark performance?

Enabling Adaptive Query Execution (AQE) in Spark 3.2+ often provides the largest single improvement because it automatically handles shuffle partition sizing, skew joins, and broadcast join conversions at runtime.

Q: Is Kryo serialization still relevant in Spark 3.x?

Yes for RDD-based operations where data is serialised between stages. For DataFrame-heavy workloads, Tungsten's binary format handles serialisation internally, so Kryo has less impact. Still recommended to enable for mixed workloads.

Q: What is the difference between coalesce and repartition for output?

Coalesce merges partitions without a full shuffle — faster but may produce uneven file sizes. Repartition triggers a full shuffle — slower but produces evenly distributed files. Use coalesce for reducing output file count; use repartition when even file sizes matter.

Q: How do you monitor a Spark job in production?

Use the Spark Web UI (port 4040), Spark History Server for completed jobs, and integrate with monitoring tools like Prometheus, Grafana, or cloud-native solutions like AWS CloudWatch or Azure Monitor. Databricks provides a built-in Ganglia UI and cluster metrics dashboard.

Q: Where can I find more Apache Spark interview questions?

Visit our complete guide: Top 100 Apache Spark Interview Questions and Answers (2026) covering all Spark topics from RDDs to Delta Lake.


Conclusion

Apache Spark performance tuning is a skill built through real-world debugging, not memorisation. The engineers who ace these interview questions are those who have sat with the Spark UI open, tracked down a data skew problem at 2am, and learned which knob to turn. Study these concepts, practise with real datasets, and you will be well prepared for any Spark performance question in your 2026 data engineering interview.

For the complete Spark interview preparation, see our Top 100 Apache Spark Interview Questions (2026).

No comments:

Post a Comment

Networking concepts of Data Engineer

Networking for Data Engineers Networking Concepts Every Data Engineer Must Know (2026) You don't need to be a n...

🚫
Content Protected
Copying content from this site is not permitted.
© 2026 InterviewQuestionsToLearn.com