Pandas API on Spark is a distributed DataFrame implementation that provides a pandas-like interface on top of Apache Spark, enabling data scientists to scale pandas workflows beyond single-machine memory limits without rewriting code. Originally developed as the Koalas project, it was merged into PySpark in Spark 3.2 and continues to evolve — Spark 4.0 brought improved Arrow integration, deprecated applymap() in favor of map(), and enhanced Spark Connect support. The key mental model: it's pandas syntax with Spark execution — lazy evaluation, distributed processing, and eventual computation on clusters, but with the same .groupby(), .merge(), and .fillna() methods you already know. Keep in mind that not all pandas APIs are supported, and operations that require ordering or single-partition computation (like rank() or sort_values()) can be significantly more expensive in a distributed context.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 216 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: DataFrame Creation and Conversion
Everything starts with getting data into — and out of — a pandas-on-Spark DataFrame. These methods are the bridges between three worlds: regular pandas, native PySpark, and the pandas API on Spark, so knowing which conversion you need (and which ones quietly pull data back to the driver, like to_pandas()) is the difference between a smooth workflow and a crashed cluster.
| Method | Example | Description |
|---|---|---|
import pyspark.pandas as ps | • Import the pandas API on Spark module • conventionally aliased as ps to distinguish from regular pandas pd. | |
df = ps.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) | Create a pandas-on-Spark DataFrame from a dictionary, list, or other data structure — syntax mirrors pandas. | |
ps_df = ps.from_pandas(pd_df) | • Convert a regular pandas DataFrame to pandas-on-Spark • data is distributed across Spark partitions after conversion. | |
pd_df = ps_df.to_pandas() | • Convert pandas-on-Spark DataFrame back to pandas • collects all data to driver memory — use cautiously with large datasets. | |
spark_df = ps_df.to_spark() | Convert pandas-on-Spark DataFrame to native PySpark DataFrame for using full Spark SQL/DataFrame API. |