Apache Iceberg is an open-source table format designed for large analytic datasets on cloud object storage (S3, ADLS, GCS), developed originally at Netflix and now an Apache Top-Level Project. It brings ACID transactions, snapshot isolation, schema evolution, and time travel to data lakes by layering a metadata-driven transactional model over immutable data files. Iceberg decouples the physical layout (Parquet/ORC/Avro files) from the logical table structure, enabling features like hidden partitioning, partition evolution, and multi-engine interoperability (Spark, Flink, Trino, Snowflake, BigQuery, Hive, Presto, Dremio, Athena, EMR). One key architectural principle: every write creates a new snapshot β an immutable point-in-time view of the table captured in a manifest list, enabling time travel, versioning, and zero-downtime concurrent writes.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 102 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts and Architecture
| Concept | Example | Description |
|---|---|---|
Iceberg is a format spec defining how to organize data files, metadata files, manifest lists, manifests, and snapshots into a logical table | Not a storage engine or query engine, but a metadata layer that sits on top of Parquet/ORC/Avro files stored in object storage; provides table semantics, schema, partition layout, and consistent snapshots | |
Each write creates snapshot 5237498123985123 with manifest list s3://bucket/snap-5237498123985123.avro | Immutable point-in-time view of a table; captures the state of all data files at commit time; every transaction produces a new snapshot, enabling time travel, rollback, and ACID guarantees | |
Avro file snap-123.avro references manifests: manifest-1.avro, manifest-2.avro, ... | Top-level metadata file for a snapshot; lists all manifest files and partition-level statistics (record count, file count, bounds); enables partition pruning at planning time without opening manifest files | |
manifest-1.avro tracks 50 data files: data-001.parquet, data-002.parquet, ... with column stats (min, max, null count, NDV estimates) | Tracks individual data files and their file-level statistics (bounds, null counts, row counts); reused across snapshots to avoid rewriting unchanged metadata; critical for predicate pushdown and file pruning |