Skip to main content

16 posts tagged with "Cazpian"

View All Tags

Apache Polaris: How Policy-Managed Table Maintenance Eliminates Iceberg Operational Overhead

· 12 min read
Cazpian Engineering
Platform Engineering Team

Apache Polaris: Policy-Managed Iceberg Table Maintenance

In our previous post, we covered how to control Iceberg file sizes at write time and how to fix small file problems with Iceberg's table maintenance procedures. The conclusion was clear: the tools are excellent, but manually scheduling and managing maintenance across dozens or hundreds of tables does not scale.

This post is about the layer that solves that problem: Apache Polaris — the open-source Iceberg catalog that introduces policy-based table maintenance, letting you define optimization rules once and have them applied automatically across your entire lakehouse.

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

· 14 min read
Cazpian Engineering
Platform Engineering Team

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

In our previous post, we broke down the five hidden costs of running two compute engines in your lakehouse — the infrastructure duplication, the cost opacity, the metadata sync bugs, the skills fragmentation, and the governance headaches. We showed that this dual-engine tax can run $40,000+ per year for a mid-size data team.

The obvious question: why not just use Spark for everything?

The honest answer has always been: because Spark cannot deliver query results to BI tools fast enough. Not because Spark cannot execute the query — it usually can — but because the last mile of data delivery through traditional JDBC/ODBC protocols is painfully slow.

Arrow Flight SQL eliminates that bottleneck. And with it, the primary architectural reason for running a second query engine disappears.

Databricks vs. EMR vs. Cazpian: The 2026 Compute Cost Showdown

· 13 min read
Cazpian Engineering
Platform Engineering Team

Databricks vs. EMR vs. Cazpian: The 2026 Compute Cost Showdown

"Which platform is cheapest for Spark?" is one of the most common questions data teams ask — and one of the most misleading. The honest answer is: it depends entirely on your workload shape.

A platform that saves you thousands on large nightly batch jobs might quietly waste thousands on your fleet of small ETL runs. The billing model that looks transparent at first glance might hide costs in cold starts, minimum increments, or idle compute you never asked for.

In this post — Part 3 of our compute cost series — we compare Databricks, Amazon EMR, and Cazpian across three realistic workload scenarios. No hypotheticals. Real pricing. Real math.

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

· 11 min read
Cazpian Engineering
Platform Engineering Team

Zero Cold Starts: How Cazpian Compute Pools Cut Your Spark Bills in Half

In Part 1 of this series, we exposed the Small Job Tax — the hidden cost of cold starts, overprovisioned clusters, and per-job infrastructure overhead that silently drains data budgets. We showed that for many teams, more than half of their Spark compute spend goes to infrastructure bootstrapping, not actual data processing.

The natural follow-up question: what if you could eliminate that overhead entirely?

That is exactly what Cazpian Compute Pools are built to do.

Introducing Cazpian: An AWS-first Lakehouse Platform

· One min read

Introducing Cazpian: An AWS-first Lakehouse Platform

We are excited to announce Cazpian, a new kind of data platform built from the ground up for AWS.

In today's world, data teams face a constant struggle: how to manage massive amounts of data without getting bogged down by infrastructure complexity. Cazpian solves this by combining the power of Apache Iceberg and Apache Spark into a seamless, managed experience.