PySpark Cheat Sheet

Updated 2026-04-20

Next Topic: Reverse ETL and Data Activation Cheat Sheet

PySpark is the Python API for Apache Spark, the distributed computing framework designed for large-scale data processing. With the release of Spark 4.0 — bringing ANSI SQL mode by default, the VARIANT data type, Python UDTFs, Spark Connect, and over 5,100 improvements — PySpark now offers an even richer set of tools for data scientists and engineers processing terabytes of data across clusters. Understanding the lazy evaluation model is critical: transformations build a logical plan, but nothing executes until an action is called, allowing Spark's Catalyst optimizer to generate the most efficient physical execution strategy. This approach makes PySpark both powerful for big data workloads and surprisingly accessible for those familiar with pandas-style operations.

What This Cheat Sheet Covers

This topic spans 31 focused tables and 314 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: SparkSession InitializationTable 2: Reading Data SourcesTable 3: DataFrame CreationTable 4: Schema DefinitionTable 5: Basic DataFrame OperationsTable 6: Column Selection and ManipulationTable 7: Filtering and Conditional LogicTable 8: AggregationsTable 9: Sorting and RankingTable 10: JoinsTable 11: Set OperationsTable 12: Null HandlingTable 13: String FunctionsTable 14: Math FunctionsTable 15: Date and Time FunctionsTable 16: JSON and Struct FunctionsTable 17: Array, Map, and Higher-Order FunctionsTable 18: Window FunctionsTable 19: Pivoting and ReshapingTable 20: User-Defined FunctionsTable 21: Partitioning and RepartitioningTable 22: Caching, Persistence, and CheckpointingTable 23: Writing DataTable 24: SQL IntegrationTable 25: RDD OperationsTable 26: Broadcast and AccumulatorsTable 27: Sampling and StatisticsTable 28: Machine Learning (MLlib)Table 29: Structured StreamingTable 30: Performance OptimizationTable 31: Configuration

Table 1: SparkSession Initialization

Method	Example	Description
SparkSession.builder	`spark = SparkSession.builder.appName("MyApp").getOrCreate()`	• Entry point for creating a Spark application • initializes or retrieves an existing SparkSession
config	`spark = SparkSession.builder.config("spark.executor.memory", "4g").getOrCreate()`	Sets Spark configuration properties like executor memory, parallelism, or shuffle partitions
master	`spark = SparkSession.builder.master("local[*]").getOrCreate()`	• Defines cluster manager URL • *local[]** uses all available cores on local machine
appName	`spark = SparkSession.builder.appName("DataPipeline").getOrCreate()`	Sets application name visible in Spark UI for monitoring and debugging
getOrCreate	`spark = SparkSession.builder.getOrCreate()`	Returns existing active session or creates new one if none exists

Table 1: SparkSession Initialization

Method	Example	Description
SparkSession.builder	`spark = SparkSession.builder.appName("MyApp").getOrCreate()`	• Entry point for creating a Spark application • initializes or retrieves an existing SparkSession
config	`spark = SparkSession.builder.config("spark.executor.memory", "4g").getOrCreate()`	Sets Spark configuration properties like executor memory, parallelism, or shuffle partitions
master	`spark = SparkSession.builder.master("local[*]").getOrCreate()`	• Defines cluster manager URL • *local[]** uses all available cores on local machine
appName	`spark = SparkSession.builder.appName("DataPipeline").getOrCreate()`	Sets application name visible in Spark UI for monitoring and debugging
getOrCreate	`spark = SparkSession.builder.getOrCreate()`	Returns existing active session or creates new one if none exists