Skip to main content

One post tagged with "OOM Debugging"

View All Tags

Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors

· 22 min read
Cazpian Engineering
Platform Engineering Team

Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors

The job fails. The log says OutOfMemoryError. Someone doubles spark.executor.memory from 8 GB to 16 GB. The job passes. Nobody asks why. The team moves on, carrying twice the compute cost forever.

Three months later, data volume grows. The same job fails again. They double it to 32 GB. Then 64 GB. Then they hit the maximum instance type and cannot scale further. Only then does someone ask: "What is actually consuming all this memory?"

This is the most expensive pattern in data engineering. Teams treat Spark memory as a single dial and OOM errors as a signal to turn it up. They do not distinguish between driver OOM and executor OOM. They do not check whether the problem is a single skewed partition, an oversized broadcast, a missing unpersist(), or a collect() call buried in a utility function. They do not know how to read GC logs, check spill metrics, or use the Spark UI to pinpoint the memory bottleneck.

This post gives you the complete debugging toolkit. We cover every type of OOM error, how to tell driver from executor, a step-by-step debugging workflow, GC tuning for G1GC and ZGC, memory observability with Prometheus and Grafana, the most common anti-patterns, and how Cazpian eliminates the guesswork.