Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse table format that provides ACID transactions, record-level updates/deletes, incremental processing, and streaming ingestion on distributed file systems (HDFS, S3, GCS, Azure ADLS). It sits atop Parquet/Avro files and integrates with Spark, Flink, Trino, Hive, Presto, and cloud-native catalog services (Glue, Unity Catalog, Polaris). This cheat sheet covers all core concepts from table types and write operations through indexing, compaction, clustering, schema evolution, and multi-engine integrations.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 284 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Architecture Concepts
Hudi organises data into file groups within partitions. Each file group contains ordered file slices consisting of a base file (Parquet) and optional delta/log files (Avro or Parquet). The timeline β a log of all table actions stored in .hoodie/ β provides MVCC snapshot isolation so readers never see partial writes. The metadata table (a hidden Hudi table itself) replaces expensive file-listing calls with indexed lookups.
| Concept | Example | Description |
|---|---|---|
partition/.hoodie/<fileId>/ | β’ Logical unit for a set of records sharing the same fileIdβ’ each file group maps to one record-key range | |
base file + 0..N log files at one instant | β’ Versioned snapshot of a file group at a commit instant β’ latest slice = current data | |
part-0000_<fileId>_<commitTime>.parquet | β’ Columnar Parquet file containing full row data β’ written at compaction (MOR) or every commit (COW) | |
.<fileId>_<commitTime>.log.1 | β’ Append-only file storing delta records (inserts, updates, deletes) for MOR tables β’ read-merged at query time | |
.hoodie/metadata/ | β’ Hidden Hudi MOR table storing files, column_stats, partition_stats, bloom_filters, rli indexesβ’ eliminates O(N) file listings |