Mastering Iceberg File Sizes: How Spark Write Controls and Table Optimization Prevent the Small File Nightmare
Every data engineer who has worked with Apache Iceberg at scale has hit the same wall: query performance that mysteriously degrades over time. The dashboards that used to load in two seconds now take twenty. The Spark jobs that processed in minutes now crawl for an hour. The root cause, almost always, is the same — thousands of tiny files have silently accumulated in your Iceberg tables.
The small file problem is not unique to Iceberg. But Iceberg gives you an unusually powerful set of tools to prevent it at the write layer and fix it at the maintenance layer. The catch is that most teams never configure these controls properly — or do not even know they exist.
