Skip to main content

3 posts tagged with "Spark Memory"

View All Tags

Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors

· 22 min read
Cazpian Engineering
Platform Engineering Team

Spark OOM Debugging: The Complete Guide to Fixing Out of Memory Errors

The job fails. The log says OutOfMemoryError. Someone doubles spark.executor.memory from 8 GB to 16 GB. The job passes. Nobody asks why. The team moves on, carrying twice the compute cost forever.

Three months later, data volume grows. The same job fails again. They double it to 32 GB. Then 64 GB. Then they hit the maximum instance type and cannot scale further. Only then does someone ask: "What is actually consuming all this memory?"

This is the most expensive pattern in data engineering. Teams treat Spark memory as a single dial and OOM errors as a signal to turn it up. They do not distinguish between driver OOM and executor OOM. They do not check whether the problem is a single skewed partition, an oversized broadcast, a missing unpersist(), or a collect() call buried in a utility function. They do not know how to read GC logs, check spill metrics, or use the Spark UI to pinpoint the memory bottleneck.

This post gives you the complete debugging toolkit. We cover every type of OOM error, how to tell driver from executor, a step-by-step debugging workflow, GC tuning for G1GC and ZGC, memory observability with Prometheus and Grafana, the most common anti-patterns, and how Cazpian eliminates the guesswork.

Spark Memory Architecture: The Complete Guide to the Unified Memory Model

· 17 min read
Cazpian Engineering
Platform Engineering Team

Spark Memory Architecture: The Complete Guide to the Unified Memory Model

A Spark job fails with OutOfMemoryError. The team doubles spark.executor.memory from 8 GB to 16 GB. The job passes. Nobody investigates why. Three months later, the same job fails again on a larger dataset. They double it again to 32 GB. The cluster bill doubles with it.

This is the most common pattern in Spark operations — and the most expensive. Teams treat memory as a single knob to turn up. They do not know that Spark splits memory into distinct regions with different purposes, different eviction rules, and different failure modes. They do not know that their 16 GB executor only gives Spark 9.42 GB of usable unified memory. They do not know that their driver OOM has nothing to do with executor memory. They do not know that half their container memory is overhead they never configured.

This post explains exactly how Spark manages memory. We cover the unified memory model with exact formulas, the difference between execution and storage memory, driver vs executor memory architecture, off-heap memory with Tungsten, container memory for YARN and Kubernetes, PySpark-specific memory, and how to calculate every region from your configuration. The companion post covers OOM debugging, GC tuning, and observability.

Spark Caching and Persistence: The Complete Guide for Iceberg and Cazpian

· 30 min read
Cazpian Engineering
Platform Engineering Team

Spark Caching and Persistence: The Complete Guide for Iceberg and Cazpian

You are running the same 500 GB join three times in a single pipeline — once for a daily summary, once for a top-products report, once for customer segmentation. Each query reads from S3, shuffles terabytes across the network, builds hash maps, and aggregates from scratch. That is 1.5 TB of redundant I/O, three redundant shuffles, and three redundant sort-merge joins.

Spark caching eliminates this waste. You compute the expensive join once, store the result in executor memory, and every subsequent query reads from that in-memory copy instead of going back to object storage. The improvement is not incremental — it is typically 10-100x faster for repeated access patterns.

But caching does something else that is less obvious and equally powerful: it makes Spark's query optimizer smarter. When a table is cached, Spark knows its exact in-memory size. If that size falls below the broadcast join threshold, the optimizer automatically converts a Sort-Merge Join into a Broadcast Hash Join — eliminating the shuffle entirely, without you writing a single hint.

This post covers every dimension of Spark caching. We start with internals, walk through every storage level, show all the ways to cache, explain the columnar storage format that makes DataFrame caching special, dive into memory management, discuss how much data you should actually cache (spoiler: not terabytes), show you how to read the Spark UI Storage tab, and cover the pitfalls that catch production workloads.