A data lakehouse is a modern data architecture that unifies the scalability of data lakes with the reliability of data warehouses by layering open table formats (Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon) on top of low-cost cloud object storage. The architecture enforces ACID transactions, schema evolution, and governance while keeping compute and storage fully decoupled β enabling SQL analytics, real-time streaming, and ML workloads to operate on a single copy of data without duplication. By 2026, the lakehouse model has matured from experimental to mainstream: the Iceberg REST Catalog has become the vendor-neutral standard, Iceberg V3 adds deletion vectors and row lineage, Delta Lake 4.0 brings Liquid Clustering and Coordinated Commits, and newer entrants like DuckLake and Lance are challenging traditional metadata architectures and serving AI-native workloads.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 158 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts
| Concept | Example | Description |
|---|---|---|
Databricks Lakehouse Platform | Unified architecture combining data lake flexibility with data warehouse reliability β supports all data types, ACID transactions, and BI/ML workloads on one platform. | |
Iceberg, Delta Lake, Hudi, Paimon | Metadata layer atop object storage providing database-like capabilities β transforms raw files into transactional, versioned, queryable tables. | |
S3 storage + Spark/Trino compute | Decoupling storage (cheap object store) from compute (elastic engines) β multiple engines query the same data independently without duplication. | |
Manifests listing every data file | Modern table formats track individual files in metadata rather than scanning directories β enables atomic commits, fast planning, and time travel. | |
MERGE INTO users USING updates ... | Atomicity, consistency, isolation, durability guarantees for concurrent reads/writes β implemented via transaction logs and optimistic concurrency. | |
Bronze β Silver β Gold | Data design pattern organizing lakehouse into Bronze (raw), Silver (cleansed), and Gold (curated) layers β incremental quality improvement from ingestion to analytics. |