Data Lakehouse Cheat Sheet

Updated 2026-04-21

Next Topic: Data Mesh Architecture Cheat Sheet

A data lakehouse is a modern data architecture that unifies the scalability of data lakes with the reliability of data warehouses by layering open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon) on top of low-cost cloud object storage. The architecture enforces ACID transactions, schema evolution, and governance while keeping compute and storage fully decoupled — enabling SQL analytics, real-time streaming, and ML workloads to operate on a single copy of data without duplication. By 2026, the lakehouse model has matured from experimental to mainstream: the Iceberg REST Catalog has become the vendor-neutral standard, Iceberg V3 adds deletion vectors and row lineage, Delta Lake 4.0 brings Liquid Clustering and Coordinated Commits, and newer entrants like DuckLake and Lance are challenging traditional metadata architectures and serving AI-native workloads.

What This Cheat Sheet Covers

This topic spans 19 focused tables and 158 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core ConceptsTable 2: Open Table FormatsTable 3: File FormatsTable 4: Metadata CatalogsTable 5: Architecture PatternsTable 6: Storage Optimization TechniquesTable 7: ACID Transaction FeaturesTable 8: Schema ManagementTable 9: Data Ingestion PatternsTable 10: Streaming IntegrationTable 11: Query EnginesTable 12: Query OptimizationTable 13: Table Maintenance OperationsTable 14: Governance & SecurityTable 15: Cross-Format InteroperabilityTable 16: Lakehouse PlatformsTable 17: Python & Lakehouse ToolsTable 18: Use Cases & WorkloadsTable 19: Benefits & Challenges

Table 1: Core Concepts

Concept	Example	Description
Data Lakehouse	Databricks Lakehouse Platform	Unified architecture combining data lake flexibility with data warehouse reliability — supports all data types, ACID transactions, and BI/ML workloads on one platform.
Open Table Format	Iceberg, Delta Lake, Hudi, Paimon	Metadata layer atop object storage providing database-like capabilities — transforms raw files into transactional, versioned, queryable tables.
Compute-Storage Separation	S3 storage + Spark/Trino compute	Decoupling storage (cheap object store) from compute (elastic engines) — multiple engines query the same data independently without duplication.
File-Level Tracking	Manifests listing every data file	Modern table formats track individual files in metadata rather than scanning directories — enables atomic commits, fast planning, and time travel.
ACID Transactions	`MERGE INTO users USING updates ...`	Atomicity, consistency, isolation, durability guarantees for concurrent reads/writes — implemented via transaction logs and optimistic concurrency.
Medallion Architecture	Bronze → Silver → Gold	Data design pattern organizing lakehouse into Bronze (raw), Silver (cleansed), and Gold (curated) layers — incremental quality improvement from ingestion to analytics.

Table 1: Core Concepts

Concept	Example	Description
Data Lakehouse	Databricks Lakehouse Platform	Unified architecture combining data lake flexibility with data warehouse reliability — supports all data types, ACID transactions, and BI/ML workloads on one platform.
Open Table Format	Iceberg, Delta Lake, Hudi, Paimon	Metadata layer atop object storage providing database-like capabilities — transforms raw files into transactional, versioned, queryable tables.
Compute-Storage Separation	S3 storage + Spark/Trino compute	Decoupling storage (cheap object store) from compute (elastic engines) — multiple engines query the same data independently without duplication.
File-Level Tracking	Manifests listing every data file	Modern table formats track individual files in metadata rather than scanning directories — enables atomic commits, fast planning, and time travel.
ACID Transactions	`MERGE INTO users USING updates ...`	Atomicity, consistency, isolation, durability guarantees for concurrent reads/writes — implemented via transaction logs and optimistic concurrency.
Medallion Architecture	Bronze → Silver → Gold	Data design pattern organizing lakehouse into Bronze (raw), Silver (cleansed), and Gold (curated) layers — incremental quality improvement from ingestion to analytics.