Skip to main content

One post tagged with "performance"

View All Tags

How to Profile Spark Jobs After Completion: The Complete Guide to Collecting Metrics on Databricks, EMR, and Dataproc

· 22 min read
Cazpian Engineering
Platform Engineering Team

How to Profile Spark Jobs After Completion

Your Spark job finished. It ran for 2 hours. It used 50 executors. The bill arrived. Now what?

Most teams have no systematic way to answer the obvious follow-up questions: Was 50 executors the right number? Did we waste memory? Was there data skew? Should we use different instance types? They either overprovision "just in case" — burning 2-5x the needed budget — or reactively debug when jobs fail with OOM errors.

The root cause is not a lack of tooling. It is a data collection problem. Spark generates extraordinarily detailed metrics at every level — task, stage, executor, application — but those metrics vanish when the application terminates. If you did not capture them during or immediately after execution, they are gone.

This post solves that problem. We cover every method for collecting Spark profiling metrics, show exactly how to set them up on Databricks, EMR, and Dataproc, and then show you how to turn those metrics into right-sizing decisions with concrete formulas and thresholds.