Databricks is a unified lakehouse platform built on Apache Spark that combines data warehousing and data lake capabilities, enabling organizations to process and analyze massive datasets at scale. Optimizing Databricks performance directly impacts query speed, cluster efficiency, and cloud costs β making it essential for production workloads. The key to effective optimization lies in understanding that Databricks provides multiple optimization layers: from Delta Lake file management (OPTIMIZE, Z-ordering, liquid clustering) to Spark query execution (AQE, low shuffle merge, predicate pushdown) to cluster resource tuning (autoscaling, Photon), and each layer compounds the performance gains when applied correctly.
What This Cheat Sheet Covers
This topic spans 9 focused tables and 66 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Delta Lake File Optimization
| Technique | Example | Description |
|---|---|---|
OPTIMIZE my_tableOPTIMIZE my_table ZORDER BY (col1, col2) | β’ Compacts small files into larger ones (targeting 1GB per file) to reduce metadata overhead and improve read performance β’ combines with Z-order for co-location. | |
CREATE TABLE events CLUSTER BY (date, region) | β’ Replaces partitioning and Z-order with adaptive clustering that automatically maintains data layout β’ supports incremental clustering during writes without full table rewrites. | |
OPTIMIZE events ZORDER BY (date, user_id) | Co-locates related data within files using a space-filling curve to maximize data skipping during queries with filters on Z-ordered columns. | |
ALTER TABLE my_table SET TBLPROPERTIES ('delta.autoOptimize.autoCompact'='AUTO') | β’ Automatically runs OPTIMIZE, VACUUM, and statistics collection based on usage patterns β’ enabled by default for Unity Catalog managed tables and being rolled out globally (completing April 2026). | |
ALTER TABLE my_table SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite'='true') | β’ Automatically compacts small files during writes β’ combines optimizeWrite (reduces file count during write) and autoCompact (post-write compaction). | |
OPTIMIZE my_table FULL | β’ Forces full reclustering of all records in a liquid-clustered table (DBR 16.0+) β’ more expensive than incremental OPTIMIZE β use after changing cluster keys or when data layout has degraded. | |
VACUUM my_table RETAIN 168 HOURS | β’ Removes old data files no longer referenced by the Delta transaction log to reclaim storage β’ default retention is 7 days, critical for time travel support. |