Spark SQL is the structured data processing module within Apache Spark, providing a full SQL interface for querying and manipulating distributed datasets alongside DataFrame/Dataset APIs. It leverages the Catalyst optimizer and Tungsten execution engine to generate efficient physical plans from declarative SQL statements. Spark 4.0 introduced ANSI SQL compliance by default, the VARIANT semi-structured data type, PIPE syntax (|>) for readable query chaining, SQL Scripting with procedural control flow, and SQL user-defined functionsβall running on JDK 17+ with Scala 2.13. Understanding how Spark SQL translates declarative queries into optimized distributed execution, manages data partitioning across clusters, and selects join strategies is essential for building scalable data pipelines that process terabytes efficiently.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 295 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core SQL Queries
| Command | Example | Description |
|---|---|---|
SELECT name, age FROM employees | Projects specific columns from a table or view. | |
SELECT * FROM employees WHERE age > 21 | Filters rows based on a boolean condition. | |
SELECT * FROM employees ORDER BY salary DESC | β’ Sorts the full result set globally β’ triggers a full shuffle across partitions. | |
SELECT * FROM employees LIMIT 10 | Returns only the first N rows from the result. | |
SELECT DISTINCT department FROM employees | Returns rows with duplicate values removed for the specified columns. | |
SELECT salary * 12 AS annual_salary FROM employees | Assigns a readable name to a computed expression or column. | |
SELECT CASE WHEN age < 18 THEN 'minor' ELSE 'adult' END AS category FROM t | Conditional expression β SQL equivalent of if-then-else logic. | |
SELECT name FROM t1 UNION ALL SELECT name FROM t2 | β’ UNION ALL combines results keeping duplicatesβ’ UNION removes duplicates. | |
SELECT id FROM t1 INTERSECT SELECT id FROM t2 | Returns rows present in both result sets. | |
SELECT id FROM t1 EXCEPT SELECT id FROM t2 | Returns rows in the first result not present in the second. | |
SELECT name, (SELECT MAX(salary) FROM employees) AS max_sal FROM employees | A query nested inside SELECT, WHERE, or FROM that returns a single value. | |
SELECT * FROM orders o WHERE EXISTS (SELECT 1 FROM returns r WHERE r.order_id = o.id) | Tests whether a correlated subquery returns any rows. |