Spark SQL Cheat Sheet

Updated 2026-04-21

Next Topic: Stream Processing Fundamentals Cheat Sheet

Spark SQL is the structured data processing module within Apache Spark, providing a full SQL interface for querying and manipulating distributed datasets alongside DataFrame/Dataset APIs. It leverages the Catalyst optimizer and Tungsten execution engine to generate efficient physical plans from declarative SQL statements. Spark 4.0 introduced ANSI SQL compliance by default, the VARIANT semi-structured data type, PIPE syntax (|>) for readable query chaining, SQL Scripting with procedural control flow, and SQL user-defined functions—all running on JDK 17+ with Scala 2.13. Understanding how Spark SQL translates declarative queries into optimized distributed execution, manages data partitioning across clusters, and selects join strategies is essential for building scalable data pipelines that process terabytes efficiently.

What This Cheat Sheet Covers

This topic spans 20 focused tables and 295 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core SQL QueriesTable 2: Data Definition (DDL)Table 3: Data Manipulation (DML)Table 4: Views and Temporary TablesTable 5: SQL Scripting and Dynamic SQLTable 6: Aggregation and GroupingTable 7: Join OperationsTable 8: Window FunctionsTable 9: String FunctionsTable 10: Date and Time FunctionsTable 11: Array, Map, and Complex TypesTable 12: Null Handling and Error-Safe FunctionsTable 13: Conditional, Math, and ConversionTable 14: JSON, CSV, and Variant FunctionsTable 15: Data TypesTable 16: Pipe Syntax (Spark 4.0+)Table 17: User-Defined FunctionsTable 18: Join Strategies and Query HintsTable 19: Optimization and Performance TuningTable 20: Configuration and Runtime Properties

Table 1: Core SQL Queries

The everyday building blocks of any query live here — projecting and filtering with SELECT and WHERE, sorting, deduplicating, and combining results with set operations. Spark also adds a few twists worth knowing, like the distinction between global ORDER BY and per-partition SORT BY, plus PIVOT, CTEs, and lateral subqueries for more expressive queries.

Command	Example	Description
SELECT	`SELECT name, age FROM employees`	Projects specific columns from a table or view.
WHERE	`SELECT * FROM employees WHERE age > 21`	Filters rows based on a boolean condition.
ORDER BY	`SELECT * FROM employees ORDER BY salary DESC`	• Sorts the full result set globally • triggers a full shuffle across partitions.
LIMIT	`SELECT * FROM employees LIMIT 10`	Returns only the first N rows from the result.
DISTINCT	`SELECT DISTINCT department FROM employees`	Returns rows with duplicate values removed for the specified columns.
Column Alias (AS)	`SELECT salary * 12 AS annual_salary FROM employees`	Assigns a readable name to a computed expression or column.
CASE WHEN	`SELECT CASE WHEN age < 18 THEN 'minor'` `ELSE 'adult' END AS category FROM t`	Conditional expression — SQL equivalent of if-then-else logic.
UNION / UNION ALL	`SELECT name FROM t1 UNION ALL SELECT name FROM t2`	• `UNION ALL` combines results keeping duplicates • `UNION` removes duplicates.
INTERSECT	`SELECT id FROM t1 INTERSECT SELECT id FROM t2`	Returns rows present in both result sets.
EXCEPT (MINUS)	`SELECT id FROM t1 EXCEPT SELECT id FROM t2`	Returns rows in the first result not present in the second.
Subquery (scalar)	`SELECT name, (SELECT MAX(salary) FROM employees) AS max_sal` `FROM employees`	A query nested inside SELECT, WHERE, or FROM that returns a single value.
EXISTS / NOT EXISTS	`SELECT * FROM orders o` `WHERE EXISTS (SELECT 1 FROM returns r WHERE r.order_id = o.id)`	Tests whether a correlated subquery returns any rows.

Table 1: Core SQL Queries

Command	Example	Description
SELECT	`SELECT name, age FROM employees`	Projects specific columns from a table or view.
WHERE	`SELECT * FROM employees WHERE age > 21`	Filters rows based on a boolean condition.
ORDER BY	`SELECT * FROM employees ORDER BY salary DESC`	• Sorts the full result set globally • triggers a full shuffle across partitions.
LIMIT	`SELECT * FROM employees LIMIT 10`	Returns only the first N rows from the result.
DISTINCT	`SELECT DISTINCT department FROM employees`	Returns rows with duplicate values removed for the specified columns.
Column Alias (AS)	`SELECT salary * 12 AS annual_salary FROM employees`	Assigns a readable name to a computed expression or column.
CASE WHEN	`SELECT CASE WHEN age < 18 THEN 'minor'` `ELSE 'adult' END AS category FROM t`	Conditional expression — SQL equivalent of if-then-else logic.
UNION / UNION ALL	`SELECT name FROM t1 UNION ALL SELECT name FROM t2`	• `UNION ALL` combines results keeping duplicates • `UNION` removes duplicates.
INTERSECT	`SELECT id FROM t1 INTERSECT SELECT id FROM t2`	Returns rows present in both result sets.
EXCEPT (MINUS)	`SELECT id FROM t1 EXCEPT SELECT id FROM t2`	Returns rows in the first result not present in the second.
Subquery (scalar)	`SELECT name, (SELECT MAX(salary) FROM employees) AS max_sal` `FROM employees`	A query nested inside SELECT, WHERE, or FROM that returns a single value.
EXISTS / NOT EXISTS	`SELECT * FROM orders o` `WHERE EXISTS (SELECT 1 FROM returns r WHERE r.order_id = o.id)`	Tests whether a correlated subquery returns any rows.