Friday, April 3, 2026

PySpark Interview Questions and Answers for Data Engineers (2026)

PySpark Interview Questions and Answers for Data Engineers (2026)

PySpark has become the dominant language for data engineering in 2026. While Scala remains popular for performance-critical Spark applications, the majority of new data pipeline code is written in Python using PySpark. Interviewers at companies from early-stage startups to FAANG now expect data engineers to write fluent PySpark code, understand its internals, and know when Python's limitations require special handling.

This guide covers the top PySpark interview questions for data engineers, with detailed answers and real code examples. Pair this with our Top 100 Apache Spark Interview Questions (2026) for complete coverage.


PySpark Fundamentals Interview Questions

1. How does PySpark communicate with Apache Spark's JVM?

This internals question separates candidates who understand PySpark's architecture from those who only know the surface API.

PySpark uses Py4J — a library that enables Python programs to dynamically access Java objects running in a JVM. When you call a PySpark DataFrame method, Python sends the instruction over a socket to a JVM Gateway Server. The Spark JVM executes the operation and returns the result.

The implication: Python code does not run on the executors for DataFrame operations. The DataFrame API is just a Python wrapper that constructs a JVM execution plan. Data stays in the JVM, processed by Tungsten's binary format. This is why DataFrame operations are fast even in Python — Python is only the driver-side orchestrator.

The exception: Python UDFs. When you register a Python UDF, each executor must launch a Python process, serialize each row from JVM to Python via pickle, execute your Python function, and serialize results back to JVM. This is the serialisation overhead that makes Python UDFs slow.


2. What is the difference between a PySpark UDF and a Pandas UDF?

This is one of the most common advanced PySpark questions in senior data engineer interviews.

Regular Python UDF:

  • Processes one row at a time
  • Each row is serialised from JVM → Python via pickle, processed, then serialised back
  • Catalyst cannot optimise inside the UDF (black box)
  • Typically 10–100x slower than built-in Spark functions
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def clean_name(name):
    return name.strip().title() if name else None

df = df.withColumn('clean_name', clean_name('raw_name'))

Pandas UDF (Vectorized UDF):

  • Processes a batch of rows as a Pandas Series
  • Uses Apache Arrow for columnar batch transfer — no row-by-row serialisation
  • Still a black box to Catalyst, but data transfer is dramatically more efficient
  • Typically 10–100x faster than row UDFs for the same logic
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf(StringType())
def clean_name_vectorized(names: pd.Series) -> pd.Series:
    return names.str.strip().str.title()

df = df.withColumn('clean_name', clean_name_vectorized('raw_name'))

Rule: Always prefer built-in Spark functions (from pyspark.sql.functions). If you must use Python logic, prefer Pandas UDFs over row UDFs. Use row UDFs only for complex Python logic that cannot be vectorised.


3. What is the Pandas API on Spark and when would you use it?

The Pandas API on Spark (import pyspark.pandas as ps, available natively since Spark 3.2) implements the Pandas DataFrame interface on top of Spark's distributed engine. You write Pandas-style code that transparently executes distributed across the cluster.

import pyspark.pandas as ps

# Read a large CSV from HDFS using Pandas syntax
psdf = ps.read_csv('hdfs:///data/large_file.csv')

# Use familiar Pandas operations
result = psdf.groupby('category')['revenue'].sum().reset_index()
result.to_csv('hdfs:///output/summary.csv')

When to use it: When you have existing Pandas code that needs to scale to larger-than-memory datasets. The migration effort is minimal — mostly replacing import pandas as pd with import pyspark.pandas as ps.

When NOT to use it: For new code, the native PySpark DataFrame API is preferred — it is more performant, better optimised by Catalyst, and the intent is clearer. The Pandas API on Spark has API coverage gaps and some operations trigger expensive full data collects.


4. How do you handle missing values in PySpark?

Missing value handling is a standard data engineering interview question. PySpark's DataFrameNaFunctions (accessed via df.na) provides clean methods:

# Drop rows with any NULL values
df_clean = df.na.drop()

# Drop rows where specific columns are NULL
df_clean = df.na.drop(subset=['customer_id', 'order_date'])

# Fill all NULLs with a single value (type must match)
df_filled = df.na.fill(0)

# Fill NULLs column-by-column with different values
df_filled = df.na.fill({
    'age': 0,
    'city': 'Unknown',
    'revenue': 0.0
})

# Replace specific values (useful for sentinel values like -1 or 'N/A')
df_replaced = df.na.replace(-1, None, subset=['age'])

For more complex imputation, use withColumn with coalesce() or when().otherwise():

from pyspark.sql.functions import coalesce, lit, when, avg

# Replace NULL age with the column average
avg_age = df.select(avg('age')).first()[0]
df = df.withColumn('age', coalesce(col('age'), lit(avg_age)))

5. Explain the difference between toPandas() with and without Arrow optimisation.

Converting a PySpark DataFrame to a Pandas DataFrame is a common operation for final reporting, visualisation, or small-scale analysis. The performance difference is significant:

Without Arrow (default in older Spark): Each row is serialised from Spark's internal binary format to Python objects via Py4J one row at a time. For a million rows this creates a million round-trips — very slow.

With Arrow enabled: Data is transferred as columnar Arrow batches. An entire column transfers in one operation with near-zero copying. Typically 10–50x faster for large DataFrames.

# Enable Arrow optimisation
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Now toPandas() uses Arrow batches
pandas_df = spark_df.toPandas()

# Also speeds up createDataFrame from Pandas
spark_df = spark.createDataFrame(pandas_df)

Warning: toPandas() collects ALL data to the driver. Only use it after filtering to a manageable size. Collecting terabytes to the driver causes OOM errors.


6. How do you read and write different file formats in PySpark?

# Read CSV with schema inference off (use explicit schema in production)
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "false") \
    .schema(mySchema) \
    .csv("s3://bucket/data/*.csv")

# Read Parquet (schema embedded, no inference needed)
df = spark.read.parquet("s3://bucket/data/")

# Read Delta Lake
df = spark.read.format("delta").load("s3://bucket/delta_table/")

# Read JSON (schema inference available but slow)
df = spark.read.json("s3://bucket/data/*.json")

# Write Parquet with partitioning
df.write \
    .partitionBy("year", "month") \
    .mode("overwrite") \
    .parquet("s3://bucket/output/")

# Write Delta Lake (append mode)
df.write.format("delta").mode("append").save("s3://bucket/delta_table/")

Best practice: Always define schemas explicitly in production code. Schema inference requires a full file scan and can infer types incorrectly (integers read as strings, nulls in numeric columns cause type errors).


7. How do you write a PySpark Structured Streaming job that reads from Kafka?

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StringType, TimestampType

spark = SparkSession.builder \
    .appName("KafkaStreamProcessor") \
    .getOrCreate()

# Define schema for the JSON value in Kafka messages
schema = StructType() \
    .add("user_id", StringType()) \
    .add("event_type", StringType()) \
    .add("event_time", TimestampType())

# Read from Kafka
raw_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "user_events") \
    .option("startingOffsets", "latest") \
    .load()

# Parse the JSON value column
parsed_stream = raw_stream \
    .select(from_json(col("value").cast("string"), schema).alias("data")) \
    .select("data.*")

# Write to Delta Lake with checkpointing
query = parsed_stream.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://bucket/checkpoints/events") \
    .start("s3://bucket/delta/events")

query.awaitTermination()

8. What are some common PySpark performance mistakes and how do you avoid them?

  1. Using Python UDFs when built-in functions exist — Built-in functions (from pyspark.sql.functions) run in the JVM and benefit from Catalyst optimisation. Always check if a built-in function exists before writing a UDF.
  2. Calling toPandas() on large DataFrames — Collects all data to the driver. Filter to a small size first.
  3. Not caching reused DataFrames — If a DataFrame is used in two places in a pipeline without caching, Spark recomputes it from source twice.
  4. Using collect() in a loop — Each collect() triggers a full job. Batch all operations and call collect() once.
  5. Chaining withColumn() in a loop — Calling withColumn() 100 times creates a deeply nested query plan. Use a single select() with all transformations instead.
  6. Not specifying schemas when reading CSV/JSON — Schema inference scans the entire file. Always provide an explicit StructType schema in production.

Frequently Asked Questions

Q: Is PySpark slower than Scala Spark?

For DataFrame and Spark SQL operations, PySpark and Scala Spark run at identical speeds — both compile to the same JVM execution plan. The performance difference arises only when using Python UDFs, which require data serialisation between JVM and Python. Use Pandas UDFs to close most of this gap.

Q: What Python version is required for PySpark 3.5?

PySpark 3.5 requires Python 3.8 or higher. Python 3.11 and 3.12 are supported and recommended for best performance with the latest optimisations.

Q: What is the difference between SparkContext and SparkSession in PySpark?

SparkContext was the original entry point (Spark 1.x) for creating RDDs. SparkSession (Spark 2.0+) is the unified entry point for DataFrames, Datasets, and SQL. SparkSession contains a SparkContext internally. All new code should use SparkSession.

Q: How do you run PySpark locally for development?

Set master to local: SparkSession.builder.master('local[*]').getOrCreate(). The [*] uses all available CPU cores. For a single thread: local[1]. For a specific number: local[4].

Q: Where can I find more Apache Spark and PySpark interview questions?

Our Top 100 Apache Spark Interview Questions (2026) covers all Spark topics including PySpark, RDDs, DataFrames, streaming, MLlib, and Delta Lake.


Conclusion

PySpark interviews in 2026 test both your Python fluency and your understanding of Spark's distributed execution model. The strongest candidates know when Python's overhead matters (UDFs), how to use Arrow and Pandas UDFs to minimise it, and how to write clean, performant PySpark pipelines that work at terabyte scale. Master these concepts and practise writing real PySpark code — the best preparation is always hands-on experience with real data.

Continue your preparation with our comprehensive Top 100 Apache Spark Interview Questions (2026).

No comments:

Post a Comment

Networking concepts of Data Engineer

Networking for Data Engineers Networking Concepts Every Data Engineer Must Know (2026) You don't need to be a n...

🚫
Content Protected
Copying content from this site is not permitted.
© 2026 InterviewQuestionsToLearn.com