Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
DATA_AND_DATABASES
Data Engineering
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Apache Iceberg Open Table Format Cheat Sheet

Apache Iceberg Open Table Format Cheat Sheet

Back to Data EngineeringUpdated 2026-05-15

Apache Iceberg is an open-source table format designed for large analytic datasets on cloud object storage (S3, ADLS, GCS), developed originally at Netflix and now an Apache Top-Level Project. It brings ACID transactions, snapshot isolation, schema evolution, and time travel to data lakes by layering a metadata-driven transactional model over immutable data files. Iceberg decouples the physical layout (Parquet/ORC/Avro files) from the logical table structure, enabling features like hidden partitioning, partition evolution, and multi-engine interoperability (Spark, Flink, Trino, Snowflake, BigQuery, Hive, Presto, Dremio, Athena, EMR). One key architectural principle: every write creates a new snapshot β€” an immutable point-in-time view of the table captured in a manifest list, enabling time travel, versioning, and zero-downtime concurrent writes.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 102 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Concepts and ArchitectureTable 2: Hidden Partitioning and Partition TransformsTable 3: Schema EvolutionTable 4: Time Travel and SnapshotsTable 5: Row-Level Operations and DeletesTable 6: Branching, Tagging, and WAPTable 7: Catalogs and Metadata ManagementTable 8: Compaction and MaintenanceTable 9: Format Versions and Spec EvolutionTable 10: Query Optimization and PerformanceTable 11: Engine IntegrationsTable 12: Write Modes and DistributionTable 13: Stored Procedures and ActionsTable 14: Python and Ecosystem LibrariesTable 15: Metadata Tables and IntrospectionTable 16: Puffin Statistics and Advanced Metadata

Table 1: Core Concepts and Architecture

ConceptExampleDescription
Table Format
Iceberg is a format spec defining how to organize data files, metadata files, manifest lists, manifests, and snapshots into a logical table
Not a storage engine or query engine, but a metadata layer that sits on top of Parquet/ORC/Avro files stored in object storage; provides table semantics, schema, partition layout, and consistent snapshots
Snapshot
Each write creates snapshot 5237498123985123 with manifest list s3://bucket/snap-5237498123985123.avro
Immutable point-in-time view of a table; captures the state of all data files at commit time; every transaction produces a new snapshot, enabling time travel, rollback, and ACID guarantees
Manifest List
Avro file snap-123.avro references manifests: manifest-1.avro, manifest-2.avro, ...
Top-level metadata file for a snapshot; lists all manifest files and partition-level statistics (record count, file count, bounds); enables partition pruning at planning time without opening manifest files
Manifest File
manifest-1.avro tracks 50 data files: data-001.parquet, data-002.parquet, ... with column stats (min, max, null count, NDV estimates)
Tracks individual data files and their file-level statistics (bounds, null counts, row counts); reused across snapshots to avoid rewriting unchanged metadata; critical for predicate pushdown and file pruning

More in Data Engineering

  • DataOps Practices and Pipeline DevOps Cheat Sheet
  • dlt (data load tool) Cheat Sheet
  • Fivetran Managed ELT Cheat Sheet
  • Snowflake Data Cloud Cheat Sheet
View all 5 topics in Data Engineering