Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Spark SQL Cheat Sheet

Spark SQL Cheat Sheet

Back to Data Engineering
Updated 2026-04-21
Next Topic: Stream Processing Fundamentals Cheat Sheet

Spark SQL is the structured data processing module within Apache Spark, providing a full SQL interface for querying and manipulating distributed datasets alongside DataFrame/Dataset APIs. It leverages the Catalyst optimizer and Tungsten execution engine to generate efficient physical plans from declarative SQL statements. Spark 4.0 introduced ANSI SQL compliance by default, the VARIANT semi-structured data type, PIPE syntax (|>) for readable query chaining, SQL Scripting with procedural control flow, and SQL user-defined functionsβ€”all running on JDK 17+ with Scala 2.13. Understanding how Spark SQL translates declarative queries into optimized distributed execution, manages data partitioning across clusters, and selects join strategies is essential for building scalable data pipelines that process terabytes efficiently.

What This Cheat Sheet Covers

This topic spans 20 focused tables and 295 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core SQL QueriesTable 2: Data Definition (DDL)Table 3: Data Manipulation (DML)Table 4: Views and Temporary TablesTable 5: SQL Scripting and Dynamic SQLTable 6: Aggregation and GroupingTable 7: Join OperationsTable 8: Window FunctionsTable 9: String FunctionsTable 10: Date and Time FunctionsTable 11: Array, Map, and Complex TypesTable 12: Null Handling and Error-Safe FunctionsTable 13: Conditional, Math, and ConversionTable 14: JSON, CSV, and Variant FunctionsTable 15: Data TypesTable 16: Pipe Syntax (Spark 4.0+)Table 17: User-Defined FunctionsTable 18: Join Strategies and Query HintsTable 19: Optimization and Performance TuningTable 20: Configuration and Runtime Properties

Table 1: Core SQL Queries

CommandExampleDescription
SELECT
SELECT name, age FROM employees
Projects specific columns from a table or view.
WHERE
SELECT * FROM employees WHERE age > 21
Filters rows based on a boolean condition.
ORDER BY
SELECT * FROM employees ORDER BY salary DESC
β€’ Sorts the full result set globally
β€’ triggers a full shuffle across partitions.
LIMIT
SELECT * FROM employees LIMIT 10
Returns only the first N rows from the result.
DISTINCT
SELECT DISTINCT department FROM employees
Returns rows with duplicate values removed for the specified columns.
Column Alias (AS)
SELECT salary * 12 AS annual_salary FROM employees
Assigns a readable name to a computed expression or column.
CASE WHEN
SELECT CASE WHEN age < 18 THEN 'minor'
ELSE 'adult' END AS category FROM t
Conditional expression β€” SQL equivalent of if-then-else logic.
UNION / UNION ALL
SELECT name FROM t1 UNION ALL SELECT name FROM t2
β€’ UNION ALL combines results keeping duplicates
β€’ UNION removes duplicates.
INTERSECT
SELECT id FROM t1 INTERSECT SELECT id FROM t2
Returns rows present in both result sets.
EXCEPT (MINUS)
SELECT id FROM t1 EXCEPT SELECT id FROM t2
Returns rows in the first result not present in the second.
Subquery (scalar)
SELECT name, (SELECT MAX(salary) FROM employees) AS max_sal
FROM employees
A query nested inside SELECT, WHERE, or FROM that returns a single value.
EXISTS / NOT EXISTS
SELECT * FROM orders o
WHERE EXISTS (SELECT 1 FROM returns r WHERE r.order_id = o.id)
Tests whether a correlated subquery returns any rows.

More in Data Engineering

  • Snowflake Data Cloud Cheat Sheet
  • Stream Processing Fundamentals Cheat Sheet
  • Airbyte Open-Source ELT Cheat Sheet
  • Big Data Storage Formats Cheat Sheet
  • Data Wrangling Cheat Sheet
  • Enterprise Data Governance Cheat Sheet
View all 53 topics in Data Engineering