Apache Iceberg is an open table format for managing large-scale analytic datasets on object storage (S3, ADLS, GCS). Originally developed at Netflix to address Hive limitations, Iceberg brings ACID transactions, schema evolution, and time travel to data lakes, enabling reliable lakehouse architectures. Unlike file formats (Parquet, Avro, ORC), Iceberg defines how data files are organized into logical tables with consistent point-in-time snapshots. The key insight: Iceberg replaces expensive directory listings with a three-layer metadata tree (metadata JSON → manifest lists → manifest files → data files), enabling massive scalability—production deployments manage petabyte-scale tables with tens of millions of files. What distinguishes Iceberg is hidden partitioning (users query raw values, transforms happen transparently), partition evolution (change partitioning without rewriting data), and vendor-neutral governance under the Apache Software Foundation with the broadest multi-engine support across Spark, Flink, Trino, Snowflake, BigQuery, and DuckDB.
What This Cheat Sheet Covers
This topic spans 23 focused tables and 179 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Table Format Concepts
| Concept | Example | Description |
|---|---|---|
Defines schema, partitioning, snapshots for data files | • Open specification for organizing raw data files (Parquet/Avro/ORC) into logical tables with ACID semantics • separates metadata from data storage | |
metadata/v1.metadata.json | Three-layer architecture: metadata JSON (schema, snapshots) → manifest lists (snapshot metadata) → manifest files (file statistics) → data files | |
snapshot_id=8744736658442914487 | • Immutable point-in-time view of table • created on every commit • enables time travel and rollback | |
snap-8744736658442914487-1-abc123.avro | • Avro file containing references to all manifest files for a snapshot • tracks partition ranges and file counts per manifest | |
abc123-m0.avro | Avro file tracking subset of data files with per-file statistics (path, partition, record count, min/max, null counts) |