Apache Iceberg is an open table format for managing large-scale analytic datasets on object storage (S3, ADLS, GCS). Originally developed at Netflix to address Hive limitations, Iceberg brings ACID transactions, schema evolution, and time travel to data lakes, enabling reliable lakehouse architectures. Unlike file formats (Parquet, Avro, ORC), Iceberg defines how data files are organized into logical tables with consistent point-in-time snapshots. The key insight: Iceberg replaces expensive directory listings with a three-layer metadata tree (metadata JSON → manifest lists → manifest files → data files), enabling massive scalability—production deployments manage petabyte-scale tables with tens of millions of files. What distinguishes Iceberg is hidden partitioning (users query raw values, transforms happen transparently), partition evolution (change partitioning without rewriting data), and vendor-neutral governance under the Apache Software Foundation with the broadest multi-engine support across Spark, Flink, Trino, Snowflake, BigQuery, and DuckDB.
Share this article