Skip to main content

17 posts tagged with "Architecture"

View All Tags

Schema Evolution in Apache Iceberg: The Feature That Saves Data Teams Thousands of Hours

· 10 min read
Cazpian Engineering
Platform Engineering Team

Schema Evolution in Apache Iceberg

Every data engineer has lived this nightmare: a product team needs a new field in the events table. In a traditional data warehouse, this means a migration ticket, a maintenance window, potentially hours of data rewriting, and a prayer that no downstream pipeline breaks. In a Hive-based data lake, it is even worse — you add the column, but old Parquet files do not have it, partition metadata gets confused, and three different teams spend a week debugging null values.

Apache Iceberg eliminates this entire class of problems. Schema evolution in Iceberg is a metadata-only operation. No data rewrites. No downtime. No broken queries. And the mechanism that makes this possible is both simple and elegant.

Apache Polaris: How Policy-Managed Table Maintenance Eliminates Iceberg Operational Overhead

· 12 min read
Cazpian Engineering
Platform Engineering Team

Apache Polaris: Policy-Managed Iceberg Table Maintenance

In our previous post, we covered how to control Iceberg file sizes at write time and how to fix small file problems with Iceberg's table maintenance procedures. The conclusion was clear: the tools are excellent, but manually scheduling and managing maintenance across dozens or hundreds of tables does not scale.

This post is about the layer that solves that problem: Apache Polaris — the open-source Iceberg catalog that introduces policy-based table maintenance, letting you define optimization rules once and have them applied automatically across your entire lakehouse.

Mastering Iceberg File Sizes: How Spark Write Controls and Table Optimization Prevent the Small File Nightmare

· 13 min read
Cazpian Engineering
Platform Engineering Team

Mastering Iceberg File Sizes: Spark Write Controls and Table Optimization

Every data engineer who has worked with Apache Iceberg at scale has hit the same wall: query performance that mysteriously degrades over time. The dashboards that used to load in two seconds now take twenty. The Spark jobs that processed in minutes now crawl for an hour. The root cause, almost always, is the same — thousands of tiny files have silently accumulated in your Iceberg tables.

The small file problem is not unique to Iceberg. But Iceberg gives you an unusually powerful set of tools to prevent it at the write layer and fix it at the maintenance layer. The catch is that most teams never configure these controls properly — or do not even know they exist.

Why Every Data Company Is Betting on Apache Iceberg — And What It Means for AI

· 13 min read
Cazpian Engineering
Platform Engineering Team

Why Every Data Company Is Betting on Apache Iceberg

Something unusual is happening in the data industry. Companies that have spent years — and billions of dollars — building proprietary storage formats are now rallying behind an open-source table format created at Netflix. Snowflake, Databricks, Dremio, Starburst, Teradata, Google BigQuery, AWS — the list keeps growing. They are not just adding Iceberg as a checkbox feature. They are making it central to their platform strategy.

If you are a data engineer, you have almost certainly heard of Apache Iceberg by now. But the more interesting question is not what Iceberg is — it is why every major vendor has decided that their own proprietary format is no longer enough.

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

· 14 min read
Cazpian Engineering
Platform Engineering Team

One Engine, Two Access Paths: How Arrow Flight SQL Makes a Single-Engine Lakehouse Possible

In our previous post, we broke down the five hidden costs of running two compute engines in your lakehouse — the infrastructure duplication, the cost opacity, the metadata sync bugs, the skills fragmentation, and the governance headaches. We showed that this dual-engine tax can run $40,000+ per year for a mid-size data team.

The obvious question: why not just use Spark for everything?

The honest answer has always been: because Spark cannot deliver query results to BI tools fast enough. Not because Spark cannot execute the query — it usually can — but because the last mile of data delivery through traditional JDBC/ODBC protocols is painfully slow.

Arrow Flight SQL eliminates that bottleneck. And with it, the primary architectural reason for running a second query engine disappears.

Why Your Data Platform Runs Two Engines — And Why That's Costing You

· 11 min read
Cazpian Engineering
Platform Engineering Team

Why Your Data Platform Runs Two Engines — And Why That's Costing You

Take an honest look at your data platform architecture. If you are running a lakehouse on AWS, there is a good chance it looks something like this: Spark clusters for ETL and data engineering, plus Trino (or Dremio, or Presto) clusters for analytics and BI queries. Two engines, two teams, two bills — all pointed at the same data.

This dual-runtime pattern has become the default architecture for most modern data platforms. And on the surface, it makes sense. Spark is great at processing data. Trino is great at querying it. Each engine solves a real problem.

But running two engines has hidden costs that most organizations never quantify — and once you add them up, the number is hard to ignore.