Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Apache Iceberg Open Table Format Cheat Sheet

Apache Iceberg Open Table Format Cheat Sheet

Back to Data Engineering
Updated 2026-05-15
Next Topic: Apache Kafka Cheat Sheet

Apache Iceberg is an open-source table format designed for large analytic datasets on cloud object storage (S3, ADLS, GCS), developed originally at Netflix and now an Apache Top-Level Project. It brings ACID transactions, snapshot isolation, schema evolution, and time travel to data lakes by layering a metadata-driven transactional model over immutable data files. Iceberg decouples the physical layout (Parquet/ORC/Avro files) from the logical table structure, enabling features like hidden partitioning, partition evolution, and multi-engine interoperability (Spark, Flink, Trino, Snowflake, BigQuery, Hive, Presto, Dremio, Athena, EMR). One key architectural principle: every write creates a new snapshot — an immutable point-in-time view of the table captured in a manifest list, enabling time travel, versioning, and zero-downtime concurrent writes.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 102 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Concepts and ArchitectureTable 2: Hidden Partitioning and Partition TransformsTable 3: Schema EvolutionTable 4: Time Travel and SnapshotsTable 5: Row-Level Operations and DeletesTable 6: Branching, Tagging, and WAPTable 7: Catalogs and Metadata ManagementTable 8: Compaction and MaintenanceTable 9: Format Versions and Spec EvolutionTable 10: Query Optimization and PerformanceTable 11: Engine IntegrationsTable 12: Write Modes and DistributionTable 13: Stored Procedures and ActionsTable 14: Python and Ecosystem LibrariesTable 15: Metadata Tables and IntrospectionTable 16: Puffin Statistics and Advanced Metadata

Table 1: Core Concepts and Architecture

To understand Iceberg you have to understand its layered metadata, because that layer is what turns a pile of Parquet files in object storage into a real ACID table. These terms — snapshot, manifest list, manifest, metadata file, and catalog — name each rung of that hierarchy, and the optimistic-concurrency model that lets writers and readers work without ever blocking each other falls out of it.

ConceptExampleDescription
Table Format
Iceberg is a format spec defining how to organize data files, metadata files, manifest lists, manifests, and snapshots into a logical table
• Not a storage engine or query engine, but a metadata layer that sits on top of Parquet/ORC/Avro files stored in object storage
• provides table semantics, schema, partition layout, and consistent snapshots
Snapshot
Each write creates snapshot 5237498123985123 with manifest list s3://bucket/snap-5237498123985123.avro
• Immutable point-in-time view of a table
• captures the state of all data files at commit time
• every transaction produces a new snapshot, enabling time travel, rollback, and ACID guarantees
Manifest List
Avro file snap-123.avro references manifests: manifest-1.avro, manifest-2.avro, ...
• Top-level metadata file for a snapshot
• lists all manifest files and partition-level statistics (record count, file count, bounds)
• enables partition pruning at planning time without opening manifest files
Manifest File
manifest-1.avro tracks 50 data files: data-001.parquet, data-002.parquet, ... with column stats (min, max, null count, NDV estimates)
• Tracks individual data files and their file-level statistics (bounds, null counts, row counts)
• reused across snapshots to avoid rewriting unchanged metadata
• critical for predicate pushdown and file pruning

More in Data Engineering

  • Apache Iceberg Cheat Sheet
  • Apache Kafka Cheat Sheet
  • Airbyte Open-Source ELT Cheat Sheet
  • Change Data Capture (CDC) Cheat Sheet
  • Databricks Delta Live Tables (DLT) Cheat Sheet
  • Great Expectations Data Quality Cheat Sheet
View all 61 topics in Data Engineering