Skip to main content

Databricks System Tables: The Observability Gap — What They Expose vs What You Actually Need for Cost Control

· 17 min read
Cazpian Engineering
Platform Engineering Team

Databricks System Tables: The Observability Gap

Databricks system tables look comprehensive on paper. Sixteen tables across ten schemas. Billing, compute, jobs, queries, lineage, audit. When Databricks deprecated Overwatch and pointed everyone to system tables, the message was clear: this is the future of observability.

But if you have ever tried to answer these questions using only system tables, you already know the gap:

  • Why is my job's GC time at 15%? (System tables do not track GC time.)
  • Which stage is spilling to disk? (System tables do not track per-stage spill.)
  • Which executor is the memory bottleneck? (System tables do not track executor-level JVM metrics.)
  • How many files does my Delta table have? Is it healthy? (System tables do not track Delta table physical metrics.)
  • What did my serverless job's infrastructure look like? (System tables do not populate node_timeline for serverless.)

This post documents exactly what Databricks system tables contain — column by column — and exactly what is missing. Every claim is verifiable against Databricks' own documentation.

What System Tables Actually Contain

Databricks exposes system tables under the system catalog in Unity Catalog. Here are the tables relevant to compute observability and cost control, with their exact schemas.

system.compute.node_timeline — Per-Node Hardware Metrics

This is the closest thing to infrastructure monitoring in system tables. It records OS-level metrics at per-node, per-minute granularity.

ColumnTypeDescription
account_idstringAccount ID
workspace_idstringWorkspace ID
cluster_idstringCluster ID
instance_idstringCloud instance ID (e.g., i-1234a6c12a2681234)
start_timetimestampRecord start (UTC)
end_timetimestampRecord end (UTC)
driverbooleanWhether this is the driver node
cpu_user_percentdoubleCPU time in userland (%)
cpu_system_percentdoubleCPU time in kernel (%)
cpu_wait_percentdoubleCPU time waiting for I/O (%)
mem_used_percentdoubleOS memory used (%) — includes all processes
mem_swap_percentdoubleMemory swap usage (%)
network_sent_bytesbigintBytes sent over network
network_received_bytesbigintBytes received from network
disk_free_bytes_per_mount_pointmapFree disk by mount point
node_typestringCloud instance type

Retention: 90 days. Scope: All-purpose and jobs compute only. Does NOT include serverless compute or SQL warehouses.

What this tells you: OS-level CPU, memory, network, and disk per node per minute.

What this does NOT tell you: Anything about Spark. No JVM heap. No GC time. No execution memory vs storage memory. No per-executor breakdown. The mem_used_percent includes all OS processes — it cannot distinguish between Spark execution memory, cached data, and background daemons.

system.billing.usage — DBU Consumption

The billing table records hourly DBU consumption with rich metadata about what consumed the DBUs.

ColumnTypeKey Detail
sku_namestringSKU name (e.g., STANDARD_ALL_PURPOSE_COMPUTE)
usage_quantitydecimalDBUs consumed
usage_start_time / usage_end_timetimestampAlways 1-hour windows
usage_metadata.cluster_idstringPopulated for classic compute. NULL for serverless.
usage_metadata.node_typestringInstance type. NULL for serverless.
usage_metadata.job_idstringFor job compute and serverless jobs. NULL for all-purpose compute.
usage_metadata.job_run_idstringSame scope as job_id
usage_metadata.warehouse_idstringFor SQL warehouses
usage_metadata.notebook_idstringFor serverless notebooks
product_features.is_serverlessbooleanWhether this is serverless usage
identity_metadata.run_asstringUser who ran the workload

Retention: Full account history. Granularity: Metered per-second, aggregated into hourly records.

Critical limitations:

  1. No per-query cost. There is no statement_id foreign key. You cannot join billing to query.history to get exact DBU cost per query. The best you can do is proportional allocation within an hourly window — which is an approximation.
  2. No cost attribution for all-purpose compute. When multiple users run notebooks on a shared cluster, usage_metadata.job_id is NULL. There is no way to attribute the hourly DBU cost to individual workloads.
  3. Serverless has no infrastructure visibility. cluster_id and node_type are both NULL. You know how many DBUs were consumed but not what hardware ran your workload.

system.lakeflow.job_run_timeline — Job Execution Data

ColumnTypeDescription
job_idstringJob ID
run_idstringRun ID
period_start_time / period_end_timetimestampRun or hourly slice
result_statestringSUCCEEDED, FAILED, CANCELLED, etc.
setup_duration_secondslongTime in setup phase
queue_duration_secondslongTime queued
execution_duration_secondslongTime executing
cleanup_duration_secondslongTime in cleanup
trigger_typestringCRON, CONTINUOUS, ONETIME, etc.
compute_idsarrayCluster IDs used

Retention: 365 days. Runs exceeding 1 hour are sliced into multiple rows aligned to clock-hour boundaries.

What this tells you: Job-level duration breakdown (setup, queue, execution, cleanup) and success/failure status.

What this does NOT tell you: Anything about Spark stages, tasks, shuffle, spill, or GC within the job. A 2-hour job that spent 90 minutes on a single skewed stage looks identical to a healthy 2-hour job.

system.query.history — SQL Query Metrics

This is the richest observability table, but it only covers SQL warehouses and serverless compute.

ColumnTypeDescription
statement_idstringUnique query ID
statement_textstringFull SQL text
execution_statusstringFINISHED, FAILED, CANCELED
compute.typestringWAREHOUSE or SERVERLESS_COMPUTE
total_duration_msbigintTotal execution time
execution_duration_msbigintTime executing
compilation_duration_msbigintTime optimizing
waiting_for_compute_duration_msbigintTime waiting for compute
waiting_at_capacity_duration_msbigintTime queued
total_task_duration_msbigintSum of all Spark task durations
read_bytesbigintTotal data read
read_rowsbigintTotal rows read
read_filesbigintFiles read after pruning
pruned_filesbigintFiles eliminated by pruning
read_partitionsbigintPartitions read
shuffle_read_bytesbigintShuffle data over network
spilled_local_bytesbigintData spilled to disk
written_bytes / written_rows / written_filesbigintOutput metrics
read_io_cache_percentintIO cache hit rate
from_result_cachebooleanWhether result was cached

Retention: 365 days. Scope: SQL warehouses and serverless compute only. Does NOT cover classic all-purpose or job clusters.

What this tells you: Query-level I/O, timing, shuffle reads, and spill — useful for identifying expensive queries.

What this does NOT tell you:

  • No GC time. The single most important metric for memory right-sizing is not captured.
  • No shuffle write bytes. Only shuffle reads are recorded — you cannot measure shuffle amplification.
  • No per-stage breakdown. A query that scans 10 GB but spills 5 GB on one stage looks like a single number. You cannot identify which operator caused the spill.
  • No peak memory per query. You cannot determine whether a query was close to OOM.
  • No execution plan. You cannot detect plan regressions programmatically. There is no system.query.profiles table — the Query Profile is UI-only.
  • No per-query DBU cost. There is no cost column. You must approximate by joining with hourly billing.

The Metrics That Are Completely Absent

Here is the systematic comparison. On the left: what Spark tracks internally via SparkListenerTaskEnd events. On the right: whether it appears in any system table.

Per-Executor JVM Metrics

Spark tracks these per executor via heartbeats and stage completion events:

MetricAvailable in System Tables?
JVMHeapMemory (peak heap usage)No
JVMOffHeapMemory (peak off-heap)No
MinorGCCount / MinorGCTimeNo
MajorGCCount / MajorGCTimeNo
OnHeapExecutionMemory (peak)No
OffHeapExecutionMemory (peak)No
OnHeapStorageMemory (peak)No

node_timeline provides mem_used_percent at the OS level — but this includes all processes, cannot distinguish heap from off-heap, and cannot separate Spark execution memory from storage memory. It is the wrong metric for memory right-sizing.

Per-Task Spark Metrics

Every SparkListenerTaskEnd event contains a TaskMetrics object with:

MetricAvailable in System Tables?
executorRunTime (wall clock per task)No
executorCpuTime (CPU time per task)No
jvmGCTime (GC time per task)No
peakExecutionMemoryNo
diskBytesSpilledAggregate only in query.history
memoryBytesSpilledNo
shuffleReadMetrics.remoteBytesReadAggregate only in query.history
shuffleReadMetrics.localBytesReadNo
shuffleReadMetrics.fetchWaitTimeNo
shuffleWriteMetrics.bytesWrittenNo
shuffleWriteMetrics.writeTimeNo
inputMetrics.bytesReadAggregate only in query.history
resultSerializationTimeNo
taskLocality (PROCESS_LOCAL, NODE_LOCAL, etc.)No

Every metric in this table is available in Spark event logs and the Spark UI. None of them appear in system tables. The entire TaskMetrics object from SparkListenerTaskEnd is absent from the system.compute namespace.

What This Means for Cost Control

Without these metrics, you cannot:

  1. Diagnose GC pressure — you do not know if executors are spending 15% of runtime garbage collecting (the #1 signal to increase memory)
  2. Detect data skew — you cannot see per-task duration distributions to find straggler tasks
  3. Identify spill sources — you see aggregate spill per query but not which stage or operator caused it
  4. Right-size executor memory — you do not know peak JVM heap usage, so you cannot calculate recommended_memory = peak_heap * 1.15
  5. Distinguish I/O-bound from CPU-bound tasks — you cannot compute cpuTime / runTime per task
  6. Measure shuffle efficiency — you have shuffle read bytes but not local vs remote ratio or fetch wait time

These are the exact metrics that drive right-sizing formulas. System tables give you the bill. They do not give you the data to reduce it.

The Delta Table Health Blind Spot

This gap is less discussed but equally impactful. Databricks system tables provide zero visibility into Delta table physical health.

What information_schema.TABLES Contains

The INFORMATION_SCHEMA.TABLES view has 15 columns:

TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE,
IS_INSERTABLE_INTO, COMMIT_ACTION, TABLE_OWNER, COMMENT,
CREATED, CREATED_BY, LAST_ALTERED, LAST_ALTERED_BY,
DATA_SOURCE_FORMAT, STORAGE_PATH, STORAGE_SUB_DIRECTORY

Notice what is absent: no table_size, no num_files, no num_rows, no num_partitions, no avg_file_size. The entire physical layer is invisible.

What You Cannot Monitor from System Tables

Monitoring NeedAvailable via System Table?Workaround
Current file count per tableNoDESCRIBE DETAIL per table
Current table size in bytesNoDESCRIBE DETAIL per table
File size distributionNoParse _delta_log checkpoint Parquet
Per-partition file countNoQuery with _metadata.file_path per table
Small file detectionNoCompute sizeInBytes / numFiles per table
Tables needing OPTIMIZENoCustom logic comparing file counts to thresholds
Tables needing VACUUMNoANALYZE TABLE COMPUTE STORAGE METRICS (DBR 18+, not persisted)
Manual OPTIMIZE/VACUUM historyNoDESCRIBE HISTORY per table
Partition skewNoPer-table _metadata.file_path queries
Z-ORDER effectivenessNoNo metrics exposed
Data skipping effectivenessNoNo metrics exposed

The only storage-related system table is system.storage.predictive_optimization_operations_history, which logs automated OPTIMIZE, VACUUM, and ANALYZE operations performed by Predictive Optimization. It does NOT log manual operations, and it does NOT show current table state — only what the automated service has done.

How You Actually Get Delta Table Health Today

You must run per-table commands individually:

-- File count and total size (single table, single point in time)
DESCRIBE DETAIL my_catalog.my_schema.my_table;
-- Returns: numFiles, sizeInBytes, partitionColumns, format, etc.

-- Operation history (single table)
DESCRIBE HISTORY my_catalog.my_schema.my_table;
-- Returns: version, timestamp, operation, operationMetrics
-- OPTIMIZE metrics: numAddedFiles, numRemovedFiles, p50FileSize, etc.

-- Storage breakdown (DBR 18+, NOT persisted)
ANALYZE TABLE my_catalog.my_schema.my_table COMPUTE STORAGE METRICS;
-- Returns: active_bytes, vacuumable_bytes, time_travel_bytes

None of these results are persisted in any system table. To monitor table health across a catalog with hundreds of tables, you must build a custom pipeline that iterates over every table, runs these commands, and stores the results yourself.

Compare This to Apache Iceberg

Iceberg provides SQL-queryable metadata tables for every table:

Iceberg Metadata TableWhat It ExposesDelta Equivalent
table.filesEvery data file: path, size, row count, column-level min/max/null countsNone — must parse _delta_log
table.partitionsPer-partition: row count, file count, total sizeNone — must query _metadata.file_path
table.snapshotsAll snapshots with added/removed files and rowsDESCRIBE HISTORY (per table, not a system table)
table.manifestsManifest files with file counts, sizes, partition infoNone
table.all_data_filesFiles across ALL snapshotsNone
table.entriesManifest entries with status, snapshot, file infoNone

On Iceberg, detecting small files across an entire catalog is a single query:

SELECT table_name, COUNT(*) as file_count,
AVG(file_size_in_bytes) as avg_file_size
FROM catalog.db.my_table.files
GROUP BY table_name
HAVING AVG(file_size_in_bytes) < 134217728 -- under 128 MB

On Delta, the same task requires iterating over every table with DESCRIBE DETAIL, parsing the results, storing them in a custom table, and building alerting on top. There is no catalog-wide query.

The Serverless Gap: Even Less Visibility

When you move to serverless compute, the observability picture gets significantly worse.

System Tables on Serverless vs Classic

System TableClassic ComputeServerless
node_timeline (CPU, memory, disk)Fully populated, per-node per-minuteNot populated at all
compute.clusters (config history)Fully populatedNot populated
billing.usage.cluster_idPopulatedNULL
billing.usage.node_typePopulated (instance type)NULL
query.historySQL warehouses onlySQL warehouses + serverless notebooks/jobs
lakeflow.job_run_timelinePopulatedPopulated

On serverless, node_timeline is completely empty. There are no user-visible nodes. You cannot see CPU utilization, memory pressure, or network I/O for serverless workloads — the data simply does not exist in system tables.

What Else Disappears on Serverless

CapabilityClassicServerless
Spark UIFull (stages, executors, DAGs, storage)Query Profile only (operator-level DAG)
Spark event logsGenerated, exportableNot generated
Executor logsAccessibleNot accessible
Init scripts (custom monitoring)SupportedNot supported
Ganglia / compute metrics UIAvailableNot available
Hard spending limitCluster policies, autoscale capsDoes not exist

The last point is critical: there is no mechanism to set a hard spending cap on serverless compute. Budget alerts only monitor with up to a 24-hour delay. Budget policies only tag. A runaway pipeline can accumulate unlimited DBU charges before anyone notices.

Per-Query Cost on Serverless

Calculating per-query cost on serverless requires joining system.query.history with system.billing.usage by timestamp window and warehouse ID. But billing records are hourly aggregates with no statement_id foreign key, and the documentation warns that "multiple records" may be "associated with a given serverless compute workload" due to the distributed architecture.

The result is an approximation at best. Exact per-query DBU attribution is not possible with current system tables.

The Overwatch Deprecation Problem

Before system tables, Databricks Labs maintained Overwatch — an open-source Scala project that parsed Spark event logs into queryable Delta tables. Overwatch provided the metrics that system tables do not.

What Overwatch Had That System Tables Do Not

Overwatch Gold TableKey MetricsSystem Table Equivalent
sparktask_goldPer-task: shuffle bytes, spill bytes, GC time, serialization time, executor IDNone
sparkexecutor_goldPer-executor: lifetime, removal reason, JVM statsNone
sparkstage_goldPer-stage: runtime, accumulator lineageNone
sparkjob_goldPer-Spark-job: stage IDs, job-to-notebook lineageNone
jobRunCostPotentialFactCluster utilization %, core-hours, cloud VM + DBU costPartial (billing.usage has DBU cost but not cloud VM cost)
notebookCommands_goldPer-command: execution time, estimated costNone

Overwatch is now officially deprecated. The GitHub repository has 134 open issues and no active development. Databricks directs all users to system tables — but as this post documents, system tables do not cover what Overwatch provided.

The gap is being filled by third-party vendors (Unravel Data, Sync Computing, Datadog) at additional cost. But the native platform observability has regressed.

The Full Gap Matrix

Metricnode_timelinebilling.usagequery.historylakeflow.*Event LogsSpark UI
GC time per executorNoNoNoNoYesYes
Peak JVM heap per executorNoNoNoNoYesYes
Per-task shuffle read/writeNoNoAggregate read onlyNoYesYes
Per-stage disk spillNoNoAggregate onlyNoYesYes
CPU time vs wall time per taskNoNoNoNoYesYes
Peak execution memoryNoNoNoNoYesYes
Task duration distribution (skew)NoNoNoNoYesYes
Executor count per jobNoNoNoNoYesYes
OS CPU/memory per nodeYesNoNoNoNoNo
DBU cost per hourNoYesNoNoNoNo
Query I/O (bytes, rows, files)NoNoYesNoNoYes
Job duration breakdownNoNoNoYesNoNo
Delta table file countNoNoNoNoNoNo
Delta table healthNoNoNoNoNoNo

The pattern is clear: system tables cover platform operations (billing, job scheduling, node hardware). Event logs and Spark UI cover engine performance (GC, shuffle, spill, skew, memory). Delta table health falls through both.

How Cazpian Handles This

On Cazpian, the observability gap described above does not exist.

Every Metric, Collected Automatically

When a Spark job or SQL query completes on Cazpian, every metric covered in this post is immediately available — no event log parsing, no custom SparkListener, no per-table DESCRIBE DETAIL iteration:

  • GC time per executor and per stage
  • Peak JVM heap memory against allocated memory
  • Per-task shuffle read/write with local vs remote breakdown
  • Per-stage disk spill — which stage, which operator, how much
  • CPU time vs wall time per task — I/O-bound vs compute-bound classification
  • Task duration distribution with automatic skew detection
  • Delta/Iceberg table health — file count, file size distribution, partition skew, compaction status
  • Exact cost per query and per job — not an hourly approximation

AI-Driven Recommendations

Cazpian's AI analyzes the collected metrics and surfaces actionable recommendations:

  • Right-sizing: Specific executor memory and core count changes based on actual peak JVM heap usage — not OS-level mem_used_percent
  • Skew remediation: Identifies skewed stages from task duration distributions and recommends salting, AQE tuning, or repartitioning
  • Spill elimination: Traces disk spill to the responsible stage and suggests the minimum memory increase
  • Table health: Identifies tables needing compaction, detects small file accumulation, and flags partition skew — across the entire catalog, not table by table

Full Visibility, Not a Black Box

On Databricks serverless, you lose node_timeline, Spark UI, event logs, and executor logs. The infrastructure is a black box.

On Cazpian, you get the same immediate startup and managed infrastructure — but with full visibility into what the infrastructure is doing. Instance types, executor counts, JVM metrics, and per-task breakdowns are all visible. You do not trade observability for convenience.

What You Can Do Today on Databricks

If you are on Databricks and need better observability, here are the available options. Note that most of these only work on classic all-purpose or job compute — serverless compute blocks event logs, Spark configs, and init scripts entirely.

1. Enable Spark Event Logs (Classic/Job Compute Only)

The most impactful single change — but only available on classic all-purpose and job clusters. Serverless compute does not generate Spark event logs. Configure cluster log delivery to write event logs to cloud storage, then parse them:

spark.eventLog.enabled                   true
spark.eventLog.dir s3a://my-bucket/spark-event-logs/
spark.eventLog.logStageExecutorMetrics true

This gives you access to every metric in the "absent from system tables" list. You need a parser (PySpark or a History Server) to make the data queryable. If you are on serverless, this option is not available — you are limited to system.query.history aggregates.

2. Build a Delta Table Health Pipeline

Iterate over tables in information_schema.TABLES, run DESCRIBE DETAIL on each, and persist results to a metrics table:

from pyspark.sql.functions import current_timestamp

tables = spark.sql("""
SELECT table_catalog, table_schema, table_name
FROM system.information_schema.tables
WHERE data_source_format = 'DELTA'
AND table_type = 'MANAGED'
""").collect()

results = []
for t in tables:
fqn = f"{t.table_catalog}.{t.table_schema}.{t.table_name}"
try:
detail = spark.sql(f"DESCRIBE DETAIL {fqn}").first()
results.append((fqn, detail.numFiles, detail.sizeInBytes))
except Exception:
pass

health_df = spark.createDataFrame(
results, ["table_name", "num_files", "size_bytes"]
).withColumn("snapshot_time", current_timestamp())

health_df.write.mode("append").saveAsTable("ops.delta_table_health")

Schedule this daily. Alert when size_bytes / num_files drops below 128 MB (small file problem) or when num_files exceeds your threshold.

3. Use Third-Party Tools

Unravel Data, Sync Computing, and Datadog all provide deeper Databricks observability than native system tables. They fill the gap at additional cost.

4. Accept the Limitations

For teams that primarily run SQL warehouse queries and only need query-level aggregate metrics, system.query.history may be sufficient. The gap mainly affects teams that need engine-level diagnostics — memory right-sizing, shuffle optimization, skew detection, and table health monitoring.

What's Next

This post documented what Databricks system tables contain and what they are missing. For the complete guide to collecting the metrics that system tables do not provide, see: