Apache Iceberg Open Table Format Cheat Sheet

Updated 2026-05-15

Apache Iceberg is an open-source table format designed for large analytic datasets on cloud object storage (S3, ADLS, GCS), developed originally at Netflix and now an Apache Top-Level Project. It brings ACID transactions, snapshot isolation, schema evolution, and time travel to data lakes by layering a metadata-driven transactional model over immutable data files. Iceberg decouples the physical layout (Parquet/ORC/Avro files) from the logical table structure, enabling features like hidden partitioning, partition evolution, and multi-engine interoperability (Spark, Flink, Trino, Snowflake, BigQuery, Hive, Presto, Dremio, Athena, EMR). One key architectural principle: every write creates a new snapshot — an immutable point-in-time view of the table captured in a manifest list, enabling time travel, versioning, and zero-downtime concurrent writes.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 102 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Concepts and ArchitectureTable 2: Hidden Partitioning and Partition TransformsTable 3: Schema EvolutionTable 4: Time Travel and SnapshotsTable 5: Row-Level Operations and DeletesTable 6: Branching, Tagging, and WAPTable 7: Catalogs and Metadata ManagementTable 8: Compaction and MaintenanceTable 9: Format Versions and Spec EvolutionTable 10: Query Optimization and PerformanceTable 11: Engine IntegrationsTable 12: Write Modes and DistributionTable 13: Stored Procedures and ActionsTable 14: Python and Ecosystem LibrariesTable 15: Metadata Tables and IntrospectionTable 16: Puffin Statistics and Advanced Metadata

Table 1: Core Concepts and Architecture

To understand Iceberg you have to understand its layered metadata, because that layer is what turns a pile of Parquet files in object storage into a real ACID table. These terms — snapshot, manifest list, manifest, metadata file, and catalog — name each rung of that hierarchy, and the optimistic-concurrency model that lets writers and readers work without ever blocking each other falls out of it.

Concept	Example	Description
Table Format	Iceberg is a format spec defining how to organize data files, metadata files, manifest lists, manifests, and snapshots into a logical table	• Not a storage engine or query engine, but a metadata layer that sits on top of Parquet/ORC/Avro files stored in object storage • provides table semantics, schema, partition layout, and consistent snapshots
Snapshot	Each write creates snapshot `5237498123985123` with manifest list `s3://bucket/snap-5237498123985123.avro`	• Immutable point-in-time view of a table • captures the state of all data files at commit time • every transaction produces a new snapshot, enabling time travel, rollback, and ACID guarantees
Manifest List	Avro file `snap-123.avro` references manifests: `manifest-1.avro`, `manifest-2.avro`, ...	• Top-level metadata file for a snapshot • lists all manifest files and partition-level statistics (record count, file count, bounds) • enables partition pruning at planning time without opening manifest files
Manifest File	`manifest-1.avro` tracks 50 data files: `data-001.parquet`, `data-002.parquet`, ... with column stats (min, max, null count, NDV estimates)	• Tracks individual data files and their file-level statistics (bounds, null counts, row counts) • reused across snapshots to avoid rewriting unchanged metadata • critical for predicate pushdown and file pruning

Table 1: Core Concepts and Architecture

Concept	Example	Description
Table Format	Iceberg is a format spec defining how to organize data files, metadata files, manifest lists, manifests, and snapshots into a logical table	• Not a storage engine or query engine, but a metadata layer that sits on top of Parquet/ORC/Avro files stored in object storage • provides table semantics, schema, partition layout, and consistent snapshots
Snapshot	Each write creates snapshot `5237498123985123` with manifest list `s3://bucket/snap-5237498123985123.avro`	• Immutable point-in-time view of a table • captures the state of all data files at commit time • every transaction produces a new snapshot, enabling time travel, rollback, and ACID guarantees
Manifest List	Avro file `snap-123.avro` references manifests: `manifest-1.avro`, `manifest-2.avro`, ...	• Top-level metadata file for a snapshot • lists all manifest files and partition-level statistics (record count, file count, bounds) • enables partition pruning at planning time without opening manifest files
Manifest File	`manifest-1.avro` tracks 50 data files: `data-001.parquet`, `data-002.parquet`, ... with column stats (min, max, null count, NDV estimates)	• Tracks individual data files and their file-level statistics (bounds, null counts, row counts) • reused across snapshots to avoid rewriting unchanged metadata • critical for predicate pushdown and file pruning