Skip to main content

One post tagged with "caching"

View All Tags

Spark Caching and Persistence: The Complete Guide for Iceberg and Cazpian

· 30 min read
Cazpian Engineering
Platform Engineering Team

Spark Caching and Persistence: The Complete Guide for Iceberg and Cazpian

You are running the same 500 GB join three times in a single pipeline — once for a daily summary, once for a top-products report, once for customer segmentation. Each query reads from S3, shuffles terabytes across the network, builds hash maps, and aggregates from scratch. That is 1.5 TB of redundant I/O, three redundant shuffles, and three redundant sort-merge joins.

Spark caching eliminates this waste. You compute the expensive join once, store the result in executor memory, and every subsequent query reads from that in-memory copy instead of going back to object storage. The improvement is not incremental — it is typically 10-100x faster for repeated access patterns.

But caching does something else that is less obvious and equally powerful: it makes Spark's query optimizer smarter. When a table is cached, Spark knows its exact in-memory size. If that size falls below the broadcast join threshold, the optimizer automatically converts a Sort-Merge Join into a Broadcast Hash Join — eliminating the shuffle entirely, without you writing a single hint.

This post covers every dimension of Spark caching. We start with internals, walk through every storage level, show all the ways to cache, explain the columnar storage format that makes DataFrame caching special, dive into memory management, discuss how much data you should actually cache (spoiler: not terabytes), show you how to read the Spark UI Storage tab, and cover the pitfalls that catch production workloads.