Delta Lake is an open-source storage framework that brings ACID transactions, scalable metadata handling, and time travel to cloud data lakes. Built on top of Parquet, it provides a transactional layer through an append-only commit log (_delta_log) that records every change, enabling reliable concurrent writes and schema evolution without sacrificing performance. Originally developed by Databricks and now a Linux Foundation project, Delta Lake has reached version 4.2.0 (on Apache Spark 4.1.0) and serves as the foundation for modern lakehouse architectures across AWS S3, Azure ADLS, and Google Cloud Storage.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 129 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts
| Concept | Example | Description |
|---|---|---|
_delta_log/00000000000000000000.json | • Append-only JSON log that records every table change • each commit creates a new log file numbered sequentially, enabling ACID guarantees and time travel | |
Multiple writers commit simultaneously | • Atomicity, Consistency, Isolation, Durability via optimistic concurrency control • failed transactions roll back without affecting committed data | |
part-00000-<uuid>.snappy.parquet | • Columnar storage format containing actual data • Delta adds metadata layer on top for transactions and versioning | |
_delta_log/00000000000000000010.checkpoint.parquet | • Parquet snapshot of table state written every 10 commits (default) • accelerates metadata reads by avoiding replay of thousands of JSON log entries • v2 checkpoints available via delta.checkpointPolicy = 'v2' | |
minReaderVersion=3, minWriterVersion=7 | • Protocol defines minimum client capabilities required to read/write a table • higher versions unlock features; supports table features model for granular opt-in |