Pandas API on Spark is a distributed DataFrame implementation that provides a pandas-like interface on top of Apache Spark, enabling data scientists to scale pandas workflows beyond single-machine memory limits without rewriting code. Originally developed as the Koalas project, it was merged into PySpark in Spark 3.2 and continues to evolve — Spark 4.0 brought improved Arrow integration, deprecated applymap() in favor of map(), and enhanced Spark Connect support. The key mental model: it's pandas syntax with Spark execution — lazy evaluation, distributed processing, and eventual computation on clusters, but with the same .groupby(), .merge(), and .fillna() methods you already know. Keep in mind that not all pandas APIs are supported, and operations that require ordering or single-partition computation (like rank() or sort_values()) can be significantly more expensive in a distributed context.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 216 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: DataFrame Creation and Conversion
| Method | Example | Description |
|---|---|---|
import pyspark.pandas as ps | • Import the pandas API on Spark module • conventionally aliased as ps to distinguish from regular pandas pd. | |
df = ps.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) | Create a pandas-on-Spark DataFrame from a dictionary, list, or other data structure — syntax mirrors pandas. | |
ps_df = ps.from_pandas(pd_df) | • Convert a regular pandas DataFrame to pandas-on-Spark • data is distributed across Spark partitions after conversion. | |
pd_df = ps_df.to_pandas() | • Convert pandas-on-Spark DataFrame back to pandas • collects all data to driver memory — use cautiously with large datasets. | |
spark_df = ps_df.to_spark() | Convert pandas-on-Spark DataFrame to native PySpark DataFrame for using full Spark SQL/DataFrame API. |