Apache Iceberg Cheat Sheet

Updated 2026-05-28

Next Topic: Apache Iceberg Open Table Format Cheat Sheet

🧠Study flashcards on this topic127 cards · spaced repetition→

Apache Iceberg is an open table format for managing large-scale analytic datasets on object storage (S3, ADLS, GCS). Originally developed at Netflix to address Hive limitations, Iceberg brings ACID transactions, schema evolution, and time travel to data lakes, enabling reliable lakehouse architectures. Unlike file formats (Parquet, Avro, ORC), Iceberg defines how data files are organized into logical tables with consistent point-in-time snapshots. The key insight: Iceberg replaces expensive directory listings with a three-layer metadata tree (metadata JSON → manifest lists → manifest files → data files), enabling massive scalability—production deployments manage petabyte-scale tables with tens of millions of files. What distinguishes Iceberg is hidden partitioning (users query raw values, transforms happen transparently), partition evolution (change partitioning without rewriting data), and vendor-neutral governance under the Apache Software Foundation with the broadest multi-engine support across Spark, Flink, Trino, Snowflake, BigQuery, and DuckDB. Apache Iceberg 1.11.0 (released May 2026) marked the GA maturity of format version 3 (V3), delivering deletion vectors, the variant type, nanosecond timestamps, geospatial types, row lineage, and the pluggable File Format API.

What This Cheat Sheet Covers

This topic spans 23 focused tables and 201 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Table Format ConceptsTable 2: ACID Transaction SemanticsTable 3: Partition Transforms (Hidden Partitioning)Table 4: Schema Evolution OperationsTable 5: Time Travel and SnapshotsTable 6: Row-Level Modifications (V2+)Table 7: Catalog TypesTable 8: Engine IntegrationsTable 9: Table Maintenance ProceduresTable 10: Performance OptimizationTable 11: Branching and TaggingTable 12: Data TypesTable 13: Format Versions (V1, V2, V3, V4)Table 14: Metadata System TablesTable 15: Spark SQL ProceduresTable 16: Physical File Formats and StorageTable 17: Streaming and CDC PatternsTable 18: Security and Access ControlTable 19: Cloud Platform ConfigurationsTable 20: Python and Language APIsTable 21: Comparison with Other Table FormatsTable 22: Production Best PracticesTable 23: Anti-Patterns to Avoid

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Core Table Format Concepts

Iceberg's metadata-driven architecture is its defining advantage over directory-based formats like Hive. Understanding the three-layer metadata tree—and how atomic commits, snapshots, and hidden partitioning interact—is the foundation for everything else.

Concept	Example	Description
Table Format	Defines schema, partitioning, snapshots for data files	• Open specification for organizing raw data files (Parquet/Avro/ORC) into logical tables with ACID semantics • separates metadata from data storage
Metadata Layer	`metadata/v1.metadata.json`	Three-layer architecture: metadata JSON (schema, snapshots) → manifest lists (snapshot metadata) → manifest files (file statistics) → data files
Snapshot	`snapshot_id=8744736658442914487`	• Immutable point-in-time view of table • created on every commit • enables time travel and rollback
Manifest List	`snap-8744736658442914487-1-abc123.avro`	• Avro file containing references to all manifest files for a snapshot • tracks partition ranges and file counts per manifest
Manifest File	`abc123-m0.avro`	Avro file tracking subset of data files with per-file statistics (path, partition, record count, min/max, null counts)

Table 1: Core Table Format Concepts

Concept	Example	Description
Table Format	Defines schema, partitioning, snapshots for data files	• Open specification for organizing raw data files (Parquet/Avro/ORC) into logical tables with ACID semantics • separates metadata from data storage
Metadata Layer	`metadata/v1.metadata.json`	Three-layer architecture: metadata JSON (schema, snapshots) → manifest lists (snapshot metadata) → manifest files (file statistics) → data files
Snapshot	`snapshot_id=8744736658442914487`	• Immutable point-in-time view of table • created on every commit • enables time travel and rollback
Manifest List	`snap-8744736658442914487-1-abc123.avro`	• Avro file containing references to all manifest files for a snapshot • tracks partition ranges and file counts per manifest
Manifest File	`abc123-m0.avro`	Avro file tracking subset of data files with per-file statistics (path, partition, record count, min/max, null counts)