Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Apache Hudi Cheat Sheet

Apache Hudi Cheat Sheet

Back to Data Engineering
Next Topic: Apache Iceberg Cheat Sheet

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse table format that provides ACID transactions, record-level updates/deletes, incremental processing, and streaming ingestion on distributed file systems (HDFS, S3, GCS, Azure ADLS). It sits atop Parquet/Avro files and integrates with Spark, Flink, Trino, Hive, Presto, and cloud-native catalog services (Glue, Unity Catalog, Polaris). This cheat sheet covers all core concepts from table types and write operations through indexing, compaction, clustering, schema evolution, and multi-engine integrations.


What This Cheat Sheet Covers

This topic spans 17 focused tables and 284 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Architecture ConceptsTable 2: Table Types β€” Copy-on-Write vs Merge-on-ReadTable 3: Query TypesTable 4: Write OperationsTable 5: Record Keys and Key GeneratorsTable 6: Indexing OptionsTable 7: Timeline and Instant StatesTable 8: CompactionTable 9: ClusteringTable 10: CleaningTable 11: Schema EvolutionTable 12: Concurrency ControlTable 13: Spark IntegrationTable 14: Flink IntegrationTable 15: Trino, Glue, and EMR IntegrationTable 16: Hudi SQL DDL ReferenceReferences

Table 1: Core Architecture Concepts

Hudi organises data into file groups within partitions. Each file group contains ordered file slices consisting of a base file (Parquet) and optional delta/log files (Avro or Parquet). The timeline β€” a log of all table actions stored in .hoodie/ β€” provides MVCC snapshot isolation so readers never see partial writes. The metadata table (a hidden Hudi table itself) replaces expensive file-listing calls with indexed lookups.

ConceptExampleDescription
File Group
partition/.hoodie/<fileId>/
β€’ Logical unit for a set of records sharing the same fileId
β€’ each file group maps to one record-key range
File Slice
base file + 0..N log files at one instant
β€’ Versioned snapshot of a file group at a commit instant
β€’ latest slice = current data
Base File
part-0000_<fileId>_<commitTime>.parquet
β€’ Columnar Parquet file containing full row data
β€’ written at compaction (MOR) or every commit (COW)
Log / Delta File
.<fileId>_<commitTime>.log.1
β€’ Append-only file storing delta records (inserts, updates, deletes) for MOR tables
β€’ read-merged at query time
Metadata Table
.hoodie/metadata/
β€’ Hidden Hudi MOR table storing files, column_stats, partition_stats, bloom_filters, rli indexes
β€’ eliminates O(N) file listings

More in Data Engineering

  • Apache Flink Cheat Sheet
  • Apache Iceberg Cheat Sheet
  • Airbyte Open-Source ELT Cheat Sheet
  • Change Data Capture (CDC) Cheat Sheet
  • Databricks Delta Live Tables (DLT) Cheat Sheet
  • Great Expectations Data Quality Cheat Sheet
View all 61 topics in Data Engineering