Pandas API on Spark Cheat Sheet

Updated 2026-04-21

Pandas API on Spark is a distributed DataFrame implementation that provides a pandas-like interface on top of Apache Spark, enabling data scientists to scale pandas workflows beyond single-machine memory limits without rewriting code. Originally developed as the Koalas project, it was merged into PySpark in Spark 3.2 and continues to evolve — Spark 4.0 brought improved Arrow integration, deprecated applymap() in favor of map(), and enhanced Spark Connect support. The key mental model: it's pandas syntax with Spark execution — lazy evaluation, distributed processing, and eventual computation on clusters, but with the same .groupby(), .merge(), and .fillna() methods you already know. Keep in mind that not all pandas APIs are supported, and operations that require ordering or single-partition computation (like rank() or sort_values()) can be significantly more expensive in a distributed context.

What This Cheat Sheet Covers

This topic spans 19 focused tables and 216 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: DataFrame Creation and ConversionTable 2: Reading and Writing DataTable 3: Data Inspection and InformationTable 4: Indexing and SelectionTable 5: Data Cleaning and TransformationTable 6: Aggregation and GroupingTable 7: Sorting and RankingTable 8: Merging and JoiningTable 9: Reshaping and PivotingTable 10: Apply and Custom FunctionsTable 11: String OperationsTable 12: DateTime OperationsTable 13: Window Functions and Rolling OperationsTable 14: Plotting and VisualizationTable 15: Configuration and OptionsTable 16: Spark Accessor and InteropTable 17: Type Conversion and Data TypesTable 18: Performance and Best PracticesTable 19: Differences from Pandas

Table 1: DataFrame Creation and Conversion

Method	Example	Description
import pyspark.pandas	`import pyspark.pandas as ps`	• Import the pandas API on Spark module • conventionally aliased as `ps` to distinguish from regular pandas `pd`.
ps.DataFrame()	`df = ps.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})`	Create a pandas-on-Spark DataFrame from a dictionary, list, or other data structure — syntax mirrors pandas.
ps.from_pandas()	`ps_df = ps.from_pandas(pd_df)`	• Convert a regular pandas DataFrame to pandas-on-Spark • data is distributed across Spark partitions after conversion.
df.to_pandas()	`pd_df = ps_df.to_pandas()`	• Convert pandas-on-Spark DataFrame back to pandas • collects all data to driver memory — use cautiously with large datasets.
df.to_spark()	`spark_df = ps_df.to_spark()`	Convert pandas-on-Spark DataFrame to native PySpark DataFrame for using full Spark SQL/DataFrame API.

Table 1: DataFrame Creation and Conversion

Method	Example	Description
import pyspark.pandas	`import pyspark.pandas as ps`	• Import the pandas API on Spark module • conventionally aliased as `ps` to distinguish from regular pandas `pd`.
ps.DataFrame()	`df = ps.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})`	Create a pandas-on-Spark DataFrame from a dictionary, list, or other data structure — syntax mirrors pandas.
ps.from_pandas()	`ps_df = ps.from_pandas(pd_df)`	• Convert a regular pandas DataFrame to pandas-on-Spark • data is distributed across Spark partitions after conversion.
df.to_pandas()	`pd_df = ps_df.to_pandas()`	• Convert pandas-on-Spark DataFrame back to pandas • collects all data to driver memory — use cautiously with large datasets.
df.to_spark()	`spark_df = ps_df.to_spark()`	Convert pandas-on-Spark DataFrame to native PySpark DataFrame for using full Spark SQL/DataFrame API.