Spark Broadcast Joins: The Complete Guide for Iceberg and Cazpian
Your 500-node Spark cluster is shuffling a 2 TB fact table across the network — serializing every row, hashing it, writing it to disk, sending it over the wire, and deserializing it on the other side — just to join it with a 50 MB dimension table. Every single query. Every single day.
The shuffle is the most expensive operation in distributed computing. It consumes network bandwidth, disk I/O, CPU cycles, and memory on every executor in the cluster. And for joins where one side is small, it is completely unnecessary.
Broadcast join eliminates the shuffle entirely. Instead of redistributing both tables across the cluster, Spark sends the small table to every executor, where it is stored as an in-memory hash map. Each executor then joins its local partitions of the large table against this hash map — no network shuffle, no disk spill, no cross-executor coordination. The result is typically a 5-20x speedup over Sort-Merge Join for eligible queries.
This post goes deep. We cover exactly what happens inside a broadcast join, all five ways to trigger one, how AQE converts joins at runtime, the real memory math that catches people off guard, when broadcast hurts instead of helps, and how to monitor and debug broadcast behavior in production.

