Spark SQL Join Strategy: The Complete Optimization Guide
Your Spark job runs for 45 minutes. You check the Spark UI and find that a single join stage consumed 38 of those minutes — shuffling 800 GB across the network because the optimizer picked SortMergeJoin for a query where one side was 40 MB after filtering. Nobody ran ANALYZE TABLE. No statistics existed. The optimizer had no idea the table was small enough to broadcast.
The join strategy is the single most impactful decision in a Spark SQL query plan. It determines whether your data shuffles across the network, whether it spills to disk, whether your driver runs out of memory, and whether your query finishes in seconds or hours. Spark offers five distinct join strategies, each with different performance characteristics, memory requirements, and failure modes. The optimizer picks one based on statistics, hints, configuration, and join type — and it often picks wrong when it lacks information.
This post covers every join strategy in Spark, how the JoinSelection decision tree works internally, how the Catalyst optimizer estimates sizes, how CBO reorders multi-table joins, every join hint and when to use it, how AQE converts strategies at runtime, the equi vs non-equi join problem, how to read physical plans, the most common anti-patterns, and a real-world decision framework you can use in production.