Apache Hudi Cheat Sheet

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse table format that provides ACID transactions, record-level updates/deletes, incremental processing, and streaming ingestion on distributed file systems (HDFS, S3, GCS, Azure ADLS). It sits atop Parquet/Avro files and integrates with Spark, Flink, Trino, Hive, Presto, and cloud-native catalog services (Glue, Unity Catalog, Polaris). This cheat sheet covers all core concepts from table types and write operations through indexing, compaction, clustering, schema evolution, and multi-engine integrations.

What This Cheat Sheet Covers

This topic spans 17 focused tables and 284 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Architecture ConceptsTable 2: Table Types — Copy-on-Write vs Merge-on-ReadTable 3: Query TypesTable 4: Write OperationsTable 5: Record Keys and Key GeneratorsTable 6: Indexing OptionsTable 7: Timeline and Instant StatesTable 8: CompactionTable 9: ClusteringTable 10: CleaningTable 11: Schema EvolutionTable 12: Concurrency ControlTable 13: Spark IntegrationTable 14: Flink IntegrationTable 15: Trino, Glue, and EMR IntegrationTable 16: Hudi SQL DDL ReferenceReferences

Table 1: Core Architecture Concepts

Hudi organises data into file groups within partitions. Each file group contains ordered file slices consisting of a base file (Parquet) and optional delta/log files (Avro or Parquet). The timeline — a log of all table actions stored in .hoodie/ — provides MVCC snapshot isolation so readers never see partial writes. The metadata table (a hidden Hudi table itself) replaces expensive file-listing calls with indexed lookups.

Concept	Example	Description
File Group	`partition/.hoodie/<fileId>/`	• Logical unit for a set of records sharing the same `fileId` • each file group maps to one record-key range
File Slice	base file + 0..N log files at one instant	• Versioned snapshot of a file group at a commit instant • latest slice = current data
Base File	`part-0000_<fileId>_<commitTime>.parquet`	• Columnar Parquet file containing full row data • written at compaction (MOR) or every commit (COW)
Log / Delta File	`.<fileId>_<commitTime>.log.1`	• Append-only file storing delta records (inserts, updates, deletes) for MOR tables • read-merged at query time
Metadata Table	`.hoodie/metadata/`	• Hidden Hudi MOR table storing `files`, `column_stats`, `partition_stats`, `bloom_filters`, `rli` indexes • eliminates O(N) file listings

Table 1: Core Architecture Concepts

Concept	Example	Description
File Group	`partition/.hoodie/<fileId>/`	• Logical unit for a set of records sharing the same `fileId` • each file group maps to one record-key range
File Slice	base file + 0..N log files at one instant	• Versioned snapshot of a file group at a commit instant • latest slice = current data
Base File	`part-0000_<fileId>_<commitTime>.parquet`	• Columnar Parquet file containing full row data • written at compaction (MOR) or every commit (COW)
Log / Delta File	`.<fileId>_<commitTime>.log.1`	• Append-only file storing delta records (inserts, updates, deletes) for MOR tables • read-merged at query time
Metadata Table	`.hoodie/metadata/`	• Hidden Hudi MOR table storing `files`, `column_stats`, `partition_stats`, `bloom_filters`, `rli` indexes • eliminates O(N) file listings