Polars is a blazingly fast DataFrame library built in Rust and designed for high-performance data processing. Its multi-threaded query engine uses Apache Arrow columnar memory, supports both eager and lazy execution, and since Polars 1.0 offers a production-stable API. The new streaming engine (2025–2026) processes datasets larger than RAM in batches, optional GPU acceleration via NVIDIA RAPIDS runs queries on CUDA hardware, and Polars Cloud brings distributed execution to any scale. This cheat sheet covers everything from basic DataFrame operations to advanced optimization, streaming sinks, GPU execution, and the latest type system additions including stable Decimal, Int128, Enum, and Array.
23 tables, 335 concepts. Select a concept node to jump to its table row.
Table 1: Fundamentals
The entry points to Polars — creating DataFrames, reading common file formats, and inspecting data. These methods cover the most frequent first steps in any Polars script or notebook.
| Method | Example | Description |
|---|---|---|
import polars as pl | • Standard import convention • pl is the universal alias | |
df = pl.DataFrame({"col1": [1, 2], "col2": ["a", "b"]}) | Create DataFrame from dict, list of dicts, or Series | |
df = pl.read_csv("data.csv") | Read CSV into an eager DataFrame with full data in memory | |
df = pl.read_parquet("data.parquet") | • Read Parquet into memory • preferred format for performance | |
lf = pl.scan_csv("data.csv") | • Create a lazy query plan without loading data • enables optimization | |
df = lf.collect() | Execute lazy plan and materialize results into a DataFrame | |
df.head(10) | Return first n rows for quick inspection | |
df.tail(10) | Return last n rows | |
df.shape | Returns (rows, columns) tuple | |
df.columns | List all column names | |
df.schema | Mapping of column names to data types | |
df.dtypes | List data types for all columns in order | |
df.describe() | Summary statistics (count, mean, std, min, max, etc.) | |
df.glimpse() | • One-line-per-column overview of schema and sample values • useful for wide DataFrames | |
df.write_csv("output.csv") | Export DataFrame to CSV | |
df.write_parquet("output.parquet") | • Export to Parquet with compression • best format for most data | |
df["city"].value_counts(sort=True) | Count occurrences of each unique value in a Series | |
df = pl.read_excel("data.xlsx", sheet_name="Sheet1") | Read Excel file using the fast calamine engine by default | |
df.write_excel("output.xlsx", worksheet="data") | Write DataFrame to an Excel file (requires xlsxwriter) | |
df = pl.read_database("SELECT * FROM tbl", conn) | Read SQL query results from any database connection | |
df = pl.read_database_uri("SELECT * FROM tbl", "postgresql://user:pw@host/db") | Read from database using a URI connection string (connectorx or ADBC) |
Table 2: Expressions and Contexts
Expressions are the building blocks of all Polars queries — reusable, composable descriptions of transformations that the engine evaluates in parallel. Understanding the four contexts (select, with_columns, filter, group_by) is the key mental model for writing idiomatic Polars.
| Method | Example | Description |
|---|---|---|
pl.col("age") | • Reference a column by name • the foundation of all expressions | |
df.select(pl.col("name"), pl.col("age")) | • Select and/or transform columns • returns new DataFrame | |
df.with_columns(pl.col("age") + 1) | Add or replace columns while keeping all others | |
df.filter(pl.col("age") > 25) | Keep rows where boolean expression is true | |
df.group_by("city").agg(pl.col("age").mean()) | Group rows and apply aggregation expressions | |
pl.col("age").alias("years") | Rename expression result to a new column name | |
pl.when(pl.col("age") > 18).then(pl.lit("adult")).otherwise(pl.lit("minor")) | Vectorized conditional — equivalent to SQL CASE WHEN | |
pl.col("code").replace({1: "a", 2: "b"}) | • Map specific values to new values • cleaner than when/then for lookups | |
pl.col("age").cast(pl.Float64) | Convert column to a different data type | |
pl.lit(10) | Create a scalar literal expression for use in arithmetic or conditions | |
df.select(pl.len()) | • Count rows in context — equivalent to COUNT(*) in SQL• replaces pl.count() | |
df.select(pl.all()) | • Select all columns • often used in aggregation ( pl.all().sum()) | |
df.select(pl.all().exclude("id")) | Select all columns except specified ones | |
pl.col("name").str.to_uppercase().str.strip_chars() | • Chain multiple transformations • the engine optimizes the full chain | |
df.select(pl.col("age").mean(), pl.col("salary").sum()) | Compute multiple aggregations in one pass | |
pl.int_range(0, pl.len(), dtype=pl.UInt32).alias("idx") | • Generate integer sequence • use with pl.len() for row indices | |
df.select(pl.col("^.*_id$")) | Select columns matching a regex pattern via pl.col() | |
df.select(pl.col(pl.Int64)) | Select all columns of a specific data type |
Table 3: Lazy vs Eager Execution
Lazy execution is the idiomatic Polars pattern for any non-trivial query. The engine builds a logical plan, applies rewrites (predicate pushdown, projection pruning, CSE), then executes optimally. The new engine parameter on collect() enables streaming or GPU execution without changing query syntax.
| Method | Example | Description |
|---|---|---|
lf = pl.scan_csv("data.csv") | • Lazy CSV scan • does not load data until .collect() | |
lf = pl.scan_parquet("data.parquet") | Lazy Parquet scan with predicate and projection pushdown | |
lf = pl.scan_ndjson("data.ndjson") | Lazy scan for newline-delimited JSON | |
lf = pl.scan_ipc("data.ipc") | Lazy scan for Apache Arrow IPC/Feather format | |
lf = df.lazy() | Convert an eager DataFrame to a LazyFrame | |
df = lf.collect() | Execute the lazy plan with all optimizations applied | |
df = lf.collect(engine="streaming") | • Execute in the new streaming engine • processes data in batches for larger-than-RAM queries | |
df = lf.collect(engine="gpu") | • Execute on an NVIDIA GPU via RAPIDS cuDF • requires pip install polars[gpu] | |
for batch in lf.collect_batches(chunk_size=50_000): process(batch) | • Streaming generator yielding sub-DataFrames • use when logic cannot be expressed as a sink | |
r1, r2 = pl.collect_all([lf1, lf2]) | Collect multiple LazyFrames with shared CSE optimization | |
df, timings = lf.profile() | Execute query and return a timing breakdown per node for performance investigation | |
print(lf.explain()) | Print the optimized query plan without executing it | |
lf.show_graph() | Visualize query plan as a graph | |
df = lf.fetch(n_rows=100) | • Execute on first n rows only • useful for schema checking during development | |
lf_cached = lf.cache() | Cache expensive intermediate results when a LazyFrame is reused in multiple branches |
Table 4: Query Optimization
Polars applies these optimizations automatically on every lazy query. Understanding them helps you write queries that cooperate with the optimizer rather than fighting it. Most optimizations are on by default; you can inspect which applied using explain().
| Technique | Example | Description |
|---|---|---|
lf.filter(pl.col("age") > 25).collect() | Pushes filter conditions down to the source so fewer rows are read | |
lf.select("name", "age").collect() | Reads only required columns, reducing I/O and memory | |
pl.collect_all([lf.agg1(), lf.agg2()]) | Detects shared subplans across multiple LazyFrames and executes them once | |
lf.with_columns(expr1, expr2) where expr1 and expr2 share sub-expressions | Caches identical sub-expressions within a single query to avoid recomputation | |
lf.head(100).collect() | Limits rows read from source when .head(), .tail(), or .slice() is used | |
lf.join(other, ...).filter(...) | Merges a join and adjacent filters into a single faster join operation | |
Multiple sequential with_columns(...) calls | Combines independent sequential with_columns calls into a single pass | |
lf_cached = lf.cache() | Manually cache an expensive result that is consumed more than once | |
pl.Config.set_engine_affinity("streaming") | Globally set default execution engine for all subsequent queries | |
df.rechunk() | Consolidate fragmented memory chunks for better cache locality after incremental builds |
Table 5: Data Selection and Filtering
Core row and column selection patterns. These are the operations used most frequently in any Polars workflow — master them before moving to joins and aggregations.
| Method | Example | Description |
|---|---|---|
df.filter(pl.col("age") > 30) | Keep rows matching a boolean expression | |
df.filter((pl.col("age") > 25) & (pl.col("city") == "NYC")) | Combine conditions with & (AND) | |
df.filter((pl.col("age") < 20) | (pl.col("age") > 60)) | • Combine conditions with |• (OR) | |
df.filter(pl.col("city").is_in(["NYC", "LA", "SF"])) | Check if values are in a list | |
df.filter(pl.col("age").is_between(20, 30)) | Check if values fall within a range (inclusive by default) | |
df.sort("age", descending=True) | Sort rows by one or more columns | |
df.sort(["city", "age"], descending=[False, True]) | Sort by multiple columns with per-column order direction | |
df.with_row_index("idx") | Add a zero-based integer row index column | |
df.sample(n=100, seed=42) | Randomly sample rows with optional reproducibility seed | |
df.slice(10, 20) | Extract rows with offset and length | |
df.unique(subset=["name", "city"]) | Remove duplicate rows based on a subset of columns | |
df.drop_nulls(subset=["age"]) | Remove rows with null values in specified columns | |
lf.gather_every(n=2, offset=0).collect() | • Take every nth row • works on both DataFrame and LazyFrame | |
df.select(pl.col("^.*_id$")) | Select columns matching a regex pattern | |
df.select(pl.col(pl.Int64)) | Select all columns of a specific data type |
Table 6: Joins
Polars joins are highly optimized in both the in-memory and streaming engines. The how="full" outer join (formerly "outer") and the new join_where() for inequality predicates cover the most common relational join patterns.
| Method | Example | Description |
|---|---|---|
df1.join(df2, on="id", how="inner") | Keep only rows with matching keys in both DataFrames | |
df1.join(df2, on="id", how="left") | Keep all rows from left DataFrame, add matching from right | |
df1.join(df2, on="id", how="right") | Keep all rows from right DataFrame, add matching from left | |
df1.join(df2, on="id", how="full", coalesce=True) | • Keep all rows from both • nulls where no match (renamed from "outer" in 1.0) | |
df1.join(df2, on="id", how="semi") | • Keep rows from left that have at least one match in right • no right columns returned | |
df1.join(df2, on="id", how="anti") | Keep rows from left that have no match in right | |
df1.join(df2, how="cross") | Cartesian product — all combinations of rows from both | |
lf1.join_where(lf2, pl.col("dur") < pl.col("time")) | • Inner join on one or more inequality predicates • supports <, >, <=, >= | |
df1.join_asof(df2, on="timestamp", strategy="backward") | • Join on nearest key match • ideal for time-series alignment | |
df1.join(df2, on=["city", "state"], how="inner") | Join on multiple columns simultaneously | |
df1.join(df2, left_on="id", right_on="user_id") | Join when key column names differ | |
df1.join(df2, on="id", validate="m:1") | Assert join cardinality: '1:1', '1:m', 'm:1', 'm:m' | |
df1.join(df2, on="id", nulls_equal=True) | Treat nulls as equal join keys (renamed from join_nulls in 1.24) | |
df1.join(df2, on="id", suffix="_right") | Suffix appended to duplicate column names from right DataFrame | |
pl.concat([df1, df2], how="vertical") | Stack DataFrames row-wise (union, same schema) | |
pl.concat([df1, df2], how="horizontal") | Concatenate DataFrames side-by-side (same row count) |
Table 7: Aggregations and Group By
Group-by and aggregation are where Polars' parallel execution shines most. Multiple aggregations in one .agg() call execute in parallel across groups — always batch your aggregations rather than chaining multiple group_by calls.
| Method | Example | Description |
|---|---|---|
df.group_by("city").agg(pl.col("age").mean()) | Group by one column and apply aggregation | |
df.group_by(["city", "state"]).agg(pl.col("age").mean()) | Group by multiple columns for hierarchical aggregation | |
df.group_by("city").agg(pl.col("age").mean(), pl.col("salary").sum()) | • Multiple aggregations in one pass • always prefer this over chaining | |
df.group_by("city", maintain_order=True).agg(...) | Preserve input row order in results (slight performance cost) | |
df.group_by("city").agg(pl.len().alias("count")) | • Count rows per group • replaces deprecated pl.count() | |
pl.col("salary").sum() | Sum all non-null values | |
pl.col("age").mean() | Arithmetic mean | |
pl.col("age").median() | Median (50th percentile) | |
pl.col("age").min(), pl.col("age").max() | Minimum and maximum values | |
pl.col("age").std(), pl.col("age").var() | Sample standard deviation and variance | |
pl.col("name").first(), pl.col("name").last() | First or last value in group | |
pl.col("city").n_unique() | Count distinct values | |
pl.col("name").implode() | Collect all values in group into a List | |
pl.col("age").quantile(0.75) | Specific quantile (e.g., 75th percentile) | |
pl.col("category").mode() | • Most frequent value • may return multiple values on tie | |
pl.col("price").filter(pl.col("status") == "sold").mean() | Apply filter inside aggregation for conditional statistics |
Table 8: Window Functions
Window functions compute a value for each row using surrounding rows defined by a partition, without collapsing the DataFrame to one row per group. The .over() expression is Polars' equivalent of SQL PARTITION BY.
| Method | Example | Description |
|---|---|---|
pl.col("salary").mean().over("department") | • Compute group aggregate and broadcast back to each row • no collapse | |
pl.col("salary").rank().over(["dept", "year"]) | Partition by multiple columns | |
pl.int_range(pl.len()).over("group").alias("row_num") | Sequential row number within partition using int_range | |
pl.col("score").rank().over("category") | • Rank within partition • handles ties via method parameter | |
pl.col("sales").cum_sum().over("month") | Running cumulative sum within each partition | |
pl.col("*").cum_count().over("group") | Running count of non-null rows within partition | |
pl.col("price").cum_min(), pl.col("price").cum_max() | Running minimum and maximum | |
pl.col("price").shift(1).over("stock") | Access previous row values within partition | |
pl.col("price").shift(-1).over("stock") | Access next row values (negative n = lead) | |
pl.col("price").diff().over("stock") | Difference from previous row value | |
pl.col("price").pct_change().over("stock") | Percentage change from previous row | |
pl.col("price").rolling_mean(window_size=3).over("stock") | Sliding window mean within partition | |
pl.col("value").forward_fill().over("group") | Propagate last valid value forward within partition | |
pl.col("value").backward_fill().over("group") | Propagate next valid value backward within partition |
Table 9: File I/O and Scanning
Polars supports all major file formats with both eager (read_*/write_*) and lazy (scan_*/sink_*) interfaces. Prefer scan_* + sink_* for large-file workflows — they enable full query optimization and streaming I/O without materializing intermediate results in RAM.
| Method | Example | Description |
|---|---|---|
pl.scan_csv("data.csv", separator=";", has_header=True) | Lazy CSV scan with delimiter and header options | |
pl.scan_parquet("data/*.parquet") | Scan multiple Parquet files matching a glob pattern | |
lf = pl.scan_ndjson("data.ndjson") | Lazy scan for newline-delimited JSON | |
lf = pl.scan_ipc("data.feather") | Lazy scan for Apache Arrow IPC/Feather format | |
lf.sink_parquet("out.parquet") | • Stream lazy query results directly to Parquet • no full collect needed | |
lf.sink_csv("out.csv") | Stream lazy query results directly to CSV | |
lf.sink_ndjson("out.ndjson") | Stream lazy query results to newline-delimited JSON | |
lf.sink_ipc("out.feather") | Stream lazy query results to Arrow IPC format | |
lf.sink_parquet(pl.PartitionBy("./out/", key="year"), mkdir=True) | Write hive-partitioned Parquet using the PartitionBy API | |
pl.read_csv("data.csv", null_values=["NA", "NULL", ""]) | Interpret specific strings as null when reading CSV | |
pl.read_csv("data.csv", skip_rows=5) | Skip header rows before parsing | |
pl.read_csv("data.csv", schema_overrides={"age": pl.Int64}) | Override inferred column types when reading | |
df.write_parquet("data.parquet", row_group_size=100_000) | Control row group size for query performance optimization | |
df = pl.read_json("data.json") | Read JSON array file into DataFrame | |
df = pl.read_ndjson("data.ndjson") | Read newline-delimited JSON (streaming-friendly format) | |
df = pl.read_ipc("data.feather") | Read Arrow IPC (Feather v2) format — fastest serialization format | |
df = pl.read_excel("data.xlsx", sheet_name="Sales") | Read Excel into DataFrame using the fast Calamine engine by default | |
df.write_excel("report.xlsx", worksheet="Summary") | Write to Excel with optional multi-sheet and formatting support | |
df = pl.read_database("SELECT * FROM orders", conn) | Read SQL query results from any DBAPI2, SQLAlchemy, or ODBC connection | |
df = pl.read_database_uri("SELECT * FROM orders", "postgresql://user:pw@host/db") | Faster than read_database for large tables via connectorx or ADBC | |
pl.scan_parquet("hf://datasets/username/dataset/**") | Scan Hugging Face datasets directly using hf:// URI |
Table 10: Schema Handling
Polars enforces a strict, known schema at all times. Getting types right at the boundary (reading, casting, or constructing) pays dividends in performance and correctness throughout your pipeline.
| Type | Example | Description |
|---|---|---|
pl.DataFrame(data, schema={"id": pl.Int64, "name": pl.String}) | Explicitly define column names and types at construction | |
pl.read_csv("data.csv", schema_overrides={"age": pl.Int64}) | Override specific inferred types when reading files | |
df.with_columns(pl.col("age").cast(pl.Int32)) | Convert column to a different data type | |
pl.col("age").cast(pl.Int64, strict=False) | Allow cast failures to produce nulls instead of errors | |
df.rename({"old_name": "new_name"}) | Rename one or more columns | |
df.drop("col1", "col2") | Remove specified columns | |
pl.col("name").cast(pl.String) | • UTF-8 string type • pl.Utf8 is a backward-compatible alias | |
pl.Int8, pl.Int16, pl.Int32, pl.Int64, pl.Int128 | • Signed integers • Int128 supports values up to ±1.7e38 | |
pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64 | • Unsigned integers • UInt8 is ideal for 0–255 encoded categoricals | |
pl.Float32, pl.Float64 | • IEEE 754 floating point • Float16 also available | |
pl.Decimal(precision=9, scale=2) | • Stable 128-bit exact decimal • use for financial data where float rounding is unacceptable | |
pl.col("category").cast(pl.Categorical) | • Dictionary-encoded strings with categories inferred at runtime • efficient for low-cardinality columns | |
pl.col("priority").cast(pl.Enum(["High", "Medium", "Low"])) | • Ordered categorical with a fixed, predefined set of values • more performant than Categorical when categories are known upfront | |
pl.col("nested").struct.field("subfield") | Nested data structure with named fields | |
pl.col("tags").list.len() | Variable-length list of homogeneous values per row | |
pl.Series("a", [[1,2],[3,4]], dtype=pl.Array(pl.Int64, 2)) | • Fixed-length array • more efficient than List when all rows have the same length | |
pl.Date, pl.Datetime, pl.Time, pl.Duration | Date, timestamp, time-of-day, and duration types | |
pl.col("age").null_count() | Count null values in a column |
Table 11: Streaming and Sinks
The new streaming engine (Polars 1.x, 2025) processes queries in morsel-driven batches, enabling larger-than-RAM workloads. The sink_* methods are the fastest way to write streaming results directly to disk without materializing a full DataFrame in memory.
| Method | Example | Description |
|---|---|---|
lf.collect(engine="streaming") | • Execute query in streaming engine • batched execution with automatic CPU fallback | |
lf.sink_parquet("out.parquet") | Stream query results to Parquet without holding full result in memory | |
lf.sink_csv("out.csv") | Stream query results to CSV | |
lf.sink_ndjson("out.ndjson") | Stream query results to newline-delimited JSON | |
lf.sink_ipc("out.feather") | Stream query results to Arrow IPC format | |
lf.sink_parquet(pl.PartitionBy("./out/", key="year"), mkdir=True) | Write hive-partitioned output in a single streaming pass | |
for batch in lf.collect_batches(chunk_size=50_000): process(batch) | • Generator of DataFrames • use only when custom per-batch Python logic is needed | |
pl.Config.set_engine_affinity("streaming") | Set streaming as default engine for all subsequent queries in the session | |
pl.scan_parquet("huge.parquet").filter(...).sink_parquet("filtered.parquet") | End-to-end streaming pipeline: scan → filter → sink without RAM overflow | |
lf.group_by("key").agg(...).collect(engine="streaming") | Group-by aggregation in streaming mode | |
pl.collect_all([lf1, lf2]) | • Collect multiple LazyFrames in one call • shared subplans execute only once |
Table 12: Performance Tuning
Polars is fast by default, but these patterns maximize throughput and minimize memory for production workloads. The most impactful wins are: use lazy API with sinks, avoid intermediate collects, use Categorical for string columns, and target the streaming or GPU engine for bottleneck queries.
| Technique | Example | Description |
|---|---|---|
lf.collect(engine="gpu") | • Execute on NVIDIA GPU via RAPIDS cuDF • install with pip install polars[gpu]• auto-fallback to CPU if unsupported | |
lf.collect(engine=pl.GPUEngine(device=0)) | • Target a specific GPU device • raise_on_fail=True disables CPU fallback | |
pl.scan_parquet("large.parquet").filter(...).sink_parquet("out.parquet") | Use lazy API with sinks for large data — avoids loading entire result into RAM | |
Build full lf.filter().select().group_by()... chain before calling .collect() | • Each .collect() breaks optimization boundaries• chain operations before materializing | |
df.with_columns(pl.col("category").cast(pl.Categorical)) | • Encodes low-cardinality string columns as integers • reduces memory and speeds joins/group_by | |
df.with_columns(pl.col("status").cast(pl.Enum(["open","closed"]))) | More performant than Categorical when the set of values is fixed and known upfront | |
pl.scan_parquet("data.parquet").filter(...).select(...).collect() | Even if the optimizer does pushdown, explicit early filtering makes intent clear | |
pl.scan_csv("data.csv").select("col1", "col2").collect() | • Select only needed columns early • optimizer prunes at source | |
pl.collect_all([lf1, lf2, lf3]) | • Process multiple LazyFrames together • CSE avoids redundant source reads | |
Automatic across all available CPU cores | • Polars uses a Rust thread pool • no configuration needed | |
df = df.rechunk() | Consolidate fragmented Arrow buffers into contiguous memory for better cache performance | |
df.estimated_size(unit="mb") | Estimate RAM footprint of a DataFrame in bytes, KB, MB, or GB | |
lf.sink_parquet(pl.PartitionBy("./data/", key="year")) | Partition output by column for faster filtered reads on subsequent queries | |
pl.enable_string_cache() | • Enable global string cache • needed when joining on Categorical columns from different DataFrames |
Table 13: Interoperability
Polars uses Apache Arrow as its internal memory format, enabling zero-copy data exchange with the broader Arrow ecosystem. Converting to/from pandas adds a copy overhead but is well-supported.
| Method | Example | Description |
|---|---|---|
df = pl.from_pandas(pandas_df) | • Convert pandas DataFrame to Polars via Arrow • copies data | |
pandas_df = df.to_pandas() | • Convert to pandas DataFrame • copies data | |
df = pl.from_arrow(arrow_table) | Create Polars DataFrame from PyArrow Table — zero-copy | |
arrow_table = df.to_arrow() | Convert to PyArrow Table — zero-copy | |
df = pl.from_numpy(np_array, schema=["col1", "col2"]) | Create DataFrame from NumPy array | |
np_array = df.to_numpy() | • Convert DataFrame to NumPy array • copies data | |
df = pl.DataFrame({"a": [1, 2], "b": [3, 4]}) | Create from Python dictionary | |
data = df.to_dict(as_series=False) | Convert to Python dictionary of lists | |
df = pl.from_records([{"a": 1, "b": 2}, {"a": 3, "b": 4}]) | Create from list of row dicts | |
df = pl.read_database("SELECT * FROM tbl", conn) | Read from DBAPI2, SQLAlchemy, ODBC, ADBC, or async drivers | |
pl.scan_parquet("hf://datasets/user/dataset/**") | Lazy scan Hugging Face datasets using the hf:// URI scheme | |
Polars uses Apache Arrow columnar format internally | Zero-copy sharing with DuckDB, Apache Arrow Flight, PyArrow, and cuDF |
Table 14: String Operations
All string operations live under the .str namespace and are vectorized over the full column. They work in all contexts (select, with_columns, filter) and compose with other expressions.
| Method | Example | Description |
|---|---|---|
pl.col("text").str.contains("pattern") | Check if string contains a substring or regex pattern | |
pl.col("text").str.starts_with("prefix"), .str.ends_with("suffix") | Check string prefix or suffix | |
pl.col("name").str.to_uppercase() | Convert string case | |
pl.col("text").str.strip_chars() | Remove leading and trailing whitespace (or specified characters) | |
pl.col("text").str.replace("old", "new") | Replace first occurrence of pattern | |
pl.col("text").str.replace_all("old", "new") | Replace all occurrences of pattern | |
pl.col("text").str.replace_many(["a","b"], ["A","B"]) | Replace multiple patterns in one pass using Aho-Corasick algorithm | |
pl.col("text").str.split(",") | Split string into a List by delimiter | |
pl.col("text").str.extract(r"(\d+)", group_index=1) | Extract first regex match capture group | |
pl.col("text").str.extract_all(r"\d+") | Extract all regex matches as a List | |
pl.col("text").str.len_chars() | • Character count (Unicode-aware) • use len_bytes() for byte length | |
pl.col("text").str.slice(0, 5) | Extract substring by start position and length | |
pl.concat_str([pl.col("first"), pl.col("last")], separator=" ") | Concatenate multiple columns into one string with separator | |
pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d") | Parse string to date/datetime using a format string | |
pl.col("num_str").str.to_integer(base=10) | Parse numeric strings to integer type |
Table 15: Datetime Operations
All datetime operations live under the .dt namespace. Polars supports Date, Datetime, Time, and Duration types with full timezone support. Calendar-aware arithmetic ("1mo", "1y") correctly handles month-boundary and DST cases.
| Method | Example | Description |
|---|---|---|
pl.col("date_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S") | Parse string to Date or Datetime type | |
pl.col("date").dt.strftime("%Y-%m-%d") | Format datetime as string | |
pl.col("date").dt.year(), .dt.month(), .dt.day() | Extract year, month, day components | |
pl.col("ts").dt.hour(), .dt.minute(), .dt.second() | Extract hour, minute, second components | |
pl.col("ts").dt.date(), pl.col("ts").dt.time() | Extract only the date or time part from a datetime | |
pl.col("date").dt.weekday() | Day of week (1=Monday, 7=Sunday) | |
pl.col("date").dt.ordinal_day() | Day of year (1–366) | |
pl.col("ts").dt.truncate("1h") | • Round down to nearest time unit • e.g., "1d", "1h", "15m" | |
pl.col("ts").dt.round("1d") | Round to nearest time unit | |
pl.col("date").dt.offset_by("1mo") | Calendar-aware date arithmetic: add "1mo", "1y", "7d", etc. | |
pl.date_range(pl.date(2024, 1, 1), pl.date(2024, 12, 31), interval="1d") | Generate a series of dates at a regular interval | |
pl.datetime_range(start, end, interval="1h", eager=True) | • Generate datetime series • supports timezone-aware intervals | |
pl.col("end") - pl.col("start") | Subtract two datetimes to get a Duration | |
pl.col("dur").dt.total_seconds() | Convert Duration to total seconds as integer | |
pl.col("ts").dt.timestamp("ms") | Convert datetime to Unix timestamp in specified time unit |
Table 16: List Operations
The list namespace enables vectorized operations over variable-length List columns, which is Polars' primary mechanism for nested data without exploding rows. This is significantly more memory-efficient than unnesting for most aggregation tasks.
| Method | Example | Description |
|---|---|---|
pl.col("tags").list.len() | Count elements in each list | |
pl.col("scores").list.sum(), .list.mean() | Aggregate all elements in each list | |
pl.col("tags").list.get(0) | Get element at index (negative index supported) | |
pl.col("tags").list.first(), .list.last() | Get first or last element of each list | |
pl.col("tags").list.contains("python") | Boolean check if any element equals target value | |
pl.col("scores").list.sort() | Sort each list ascending or descending | |
pl.col("tags").list.unique() | Return unique elements from each list | |
pl.col("items").list.slice(1, 3) | Extract sub-list from each row by offset and length | |
pl.col("items").list.head(2), .list.tail(2) | Take first or last N elements from each list | |
pl.col("list1").list.concat(pl.col("list2")) | Concatenate two list columns element-wise | |
df.explode("tags") | Expand list column into multiple rows (one per element) | |
pl.col("items").list.gather([0, 2]) | Select elements at specified indices from each list | |
pl.col("items").list.sample(n=2) | Randomly sample N elements from each list | |
pl.col("prices").list.eval(pl.element() * 1.1) | Apply an expression to each list as if it were a miniature Series | |
pl.col("a").list.set_union(pl.col("b")) | Set union of two list columns element-wise | |
pl.col("a").list.set_intersection(pl.col("b")) | Set intersection of two list columns element-wise |
Table 17: Null Handling
Polars uses Arrow's null bitmask — nulls are not stored as NaN or sentinel values, but as a separate validity mask. This means null != NaN; NaN is a valid Float64 value while null represents missing data of any type.
| Method | Example | Description |
|---|---|---|
pl.col("a").is_null(), pl.col("a").is_not_null() | Boolean mask for null or non-null values | |
pl.col("a").fill_null(0) | Replace nulls with a literal value | |
pl.col("a").fill_null(strategy="forward") | Strategies: "forward", "backward", "mean", "min", "max", "zero", "one" | |
pl.col("a").fill_null(pl.col("b")) | Fill nulls from another column | |
pl.col("price").forward_fill() | Propagate last valid value forward to fill nulls | |
pl.col("price").backward_fill() | Propagate next valid value backward to fill nulls | |
df.drop_nulls(subset=["col1", "col2"]) | Drop rows containing nulls (in any or specified columns) | |
df.null_count() | Count nulls per column across the full DataFrame | |
pl.col("float_col").fill_nan(0.0) | • Replace NaN values specifically — distinct from fill_null• affects float columns only | |
pl.col("val").is_nan(), pl.col("val").is_not_nan() | • Detect NaN in float columns • NaN is not null | |
pl.coalesce([pl.col("a"), pl.col("b"), pl.lit(0)]) | Return first non-null value across expressions (SQL-style COALESCE) |
Table 18: Rolling and Time-Series Aggregations
Rolling window functions process a sliding window of rows, computing a statistic over each window position. Polars supports both row-count-based windows and time-duration-based windows via the _by variants.
| Method | Example | Description |
|---|---|---|
pl.col("price").rolling_mean(window_size=7) | Moving average over a fixed number of rows | |
pl.col("sales").rolling_sum(window_size=30) | Moving sum over a fixed number of rows | |
pl.col("price").rolling_min(window_size=7), .rolling_max(window_size=7) | Moving minimum and maximum | |
pl.col("price").rolling_std(window_size=10) | Moving standard deviation and variance | |
pl.col("value").rolling_mean_by("timestamp", window_size="2h") | • Time-aware rolling mean • window defined by duration, not row count | |
pl.col("sales").rolling_sum_by("date", window_size="7d") | Time-aware rolling sum over a date/datetime column | |
pl.col("price").ewm_mean(span=20) | • Exponentially weighted moving average • span, alpha, or halflife parameter | |
pl.col("price").ewm_std(span=20) | Exponentially weighted standard deviation | |
df.group_by_dynamic("date", every="1w").agg(pl.col("sales").sum()) | Time-bucket aggregation: group by dynamic time windows | |
pl.col("price").rolling_min_by("timestamp", window_size="1d") | Time-aware rolling minimum and maximum |
Table 19: Pivoting and Reshaping
Reshaping between wide and long formats is a fundamental preprocessing step. Polars' unpivot() (replacing deprecated melt()) converts wide to long, while pivot() goes long to wide — note that pivot() is an eager operation that requires schema knowledge upfront.
| Method | Example | Description |
|---|---|---|
df.unpivot(on=["jan","feb","mar"], index="product") | • Reshape wide to long format • on is the value columns, index is the id columns | |
df.pivot(on="month", index="product", values="sales") | • Reshape long to wide • column values become column headers | |
df.pivot(on="month", index="product", values="sales", aggregate_function="sum") | Aggregate duplicate key combinations during pivot | |
df.explode("tags") | Expand List column to multiple rows — one row per element | |
df.unnest("struct_col") | Flatten Struct column into individual top-level columns | |
pl.concat([df1, df2], how="vertical") | • Stack DataFrames vertically • schemas must match | |
pl.concat([df1, df2], how="horizontal") | Join DataFrames side-by-side by row position | |
pl.concat([df1, df2], how="diagonal") | • Stack DataFrames with schema union • missing columns become null | |
df.transpose(include_header=True, header_name="field") | • Flip rows and columns • all values become the same type | |
df.melt(id_vars=["id"], value_vars=["a","b"]) | Deprecated since Polars 1.0 — use unpivot(on=[...], index=[...]) instead |
Table 20: Advanced Operations
These are powerful but less frequently used operations that enable complex data transformations. map_elements is an escape hatch for Python-level logic but should be avoided in performance-critical paths; prefer native expressions.
| Method | Example | Description |
|---|---|---|
pl.when(pl.col("age") > 18).then(pl.lit("adult")).otherwise(pl.lit("minor")) | • Vectorized conditional logic • SQL CASE WHEN equivalent | |
pl.col("status").replace({"A": "Active", "I": "Inactive"}) | • Map values via dict • cleaner than when/then chains for simple lookups | |
pl.col("code").replace_strict({1: "X", 2: "Y"}, default="other", return_dtype=pl.String) | • Like replace but enforces complete mapping• unmapped values raise error unless default is set | |
lf1.join_where(lf2, pl.col("a") < pl.col("b"), pl.col("c") == pl.col("d")) | • Inequality and mixed-condition joins • use where equi-join is insufficient | |
lf.gather([0, 5, 10]) | • Select rows by integer index array • lazy equivalent of row-position selection | |
pl.col("col").map_elements(lambda x: x * 2, return_dtype=pl.Int64) | • Apply arbitrary Python function per-element • slow, avoid in hot paths | |
pl.col("col").map_batches(lambda s: custom_fn(s)) | • Apply a function to the whole Series at once • faster than map_elements | |
pl.struct(["col1", "col2"]).alias("nested") | Combine multiple columns into a single Struct column | |
pl.col("struct_col").struct.field("fieldname") | Access a named field inside a Struct column | |
df.sample(n=100, seed=42) | Randomly sample N rows | |
df.group_by("grp").map_groups(fn) | • Apply a Python function to each group as a DataFrame • slow but flexible | |
df.unique(subset=["col1", "col2"], keep="first") | • Deduplicate rows by subset • keep can be "first", "last", or "any" |
Table 21: Statistical Functions
Statistical expressions work in any Polars context and compose with group-by, window functions, and filters. describe() is useful for quick EDA; for production pipelines, use individual aggregation expressions to avoid collecting everything into Python.
| Function | Example | Description |
|---|---|---|
df.describe() | Summary statistics (count, mean, std, min, max, percentiles) for all columns | |
pl.pearson_corr("col1", "col2") | Pearson correlation coefficient between two columns | |
pl.spearman_rank_corr("col1", "col2") | • Spearman rank correlation • more robust than Pearson for non-linear data | |
pl.col("age").quantile(0.9, interpolation="linear") | Percentile with interpolation methods: "linear", "nearest", "lower", "higher", "midpoint" | |
pl.col("price").std(ddof=1), .var(ddof=1) | • Standard deviation and variance • ddof=1 for sample, ddof=0 for population | |
pl.col("price").skew() | Measure of asymmetry in distribution | |
pl.col("price").kurtosis() | Measure of tail heaviness (excess kurtosis by default) | |
pl.col("probs").entropy(base=2) | • Shannon entropy • base=2 for bits, base=math.e for nats | |
df["category"].value_counts(sort=True) | Frequency table with optional sort by count | |
pl.col("price").arg_sort() | Return row indices that would sort the column | |
pl.col("price").arg_max(), .arg_min() | Index of the maximum or minimum value | |
pl.col("category").n_unique() | Count of distinct values | |
pl.col("user_id").approx_n_unique() | • Fast approximate distinct count using HyperLogLog • much faster for large columns | |
pl.col("category").mode() | Most frequently occurring value(s) |
Table 22: Column Selectors
The cs module (polars.selectors) provides semantic column selection patterns — far more readable than manual type checks or regex filters on column names. Selectors compose with set operators to build complex selection logic in one expression.
| Selector | Example | Description |
|---|---|---|
df.select(cs.numeric()) | Select all numeric columns (integer and float types) | |
df.select(cs.string()) | Select all String/Utf8 columns | |
df.select(cs.boolean()) | Select all Boolean columns | |
df.select(cs.temporal()) | Select all temporal columns (Date, Datetime, Duration, Time) | |
df.select(cs.by_dtype(pl.Float64, pl.Float32)) | Select columns matching specified dtype(s) | |
df.select(cs.by_name("a", "b", "c")) | Select columns by exact name | |
df.select(cs.starts_with("sales_")) | Select columns whose names start with a prefix | |
df.select(cs.ends_with("_id")) | Select columns whose names end with a suffix | |
df.select(cs.contains("2024")) | Select columns whose names contain a substring | |
df.select(cs.matches(r"^q[1-4]_")) | Select columns whose names match a regex pattern | |
df.select(cs.all()) | Select all columns (equivalent to pl.col("*")) | |
df.select(cs.first()), df.select(cs.last()) | Select only the first or last column | |
df.select(cs.numeric() | cs.boolean()) | • Combine selectors with set union ( |• ) | |
df.select(cs.numeric() & cs.starts_with("q")) | Columns matching both selectors using & | |
df.select(~cs.numeric()) | Select all columns except those matching the selector | |
df.select(cs.numeric() - cs.by_name("id")) | Numeric columns excluding the id column | |
cs.expand_selector(df, cs.numeric()) | Resolve selector to an explicit list of column names |
Table 23: Practical Patterns and Examples
These are complete, idiomatic multi-step patterns that combine Polars primitives for common real-world tasks. Each pattern follows the principle: push all work into a single lazy query, optimize it, and materialize only once.
| Technique | Example | Description |
|---|---|---|
pl.scan_parquet("huge/*.parquet").filter(pl.col("year") == 2024).select("id","sales").sink_parquet("filtered.parquet") | Full scan → filter → project → sink pipeline without loading into RAM | |
pl.scan_parquet("data.parquet").group_by("key").agg(pl.col("val").sum()).collect(engine="gpu") | • Execute aggregation on NVIDIA GPU via cuDF • auto-falls back to CPU on failure | |
pl.collect_all([lf_sales, lf_returns, lf_inventory]) | Execute multiple related queries sharing common subexpressions in one pass | |
df.with_columns(pl.when(pl.col("score") > 90).then("A").when(pl.col("score") > 80).then("B").otherwise("C").alias("grade")) | Multi-branch conditional assignment with chained when/then | |
df.with_columns(pl.col("code").replace({"A": "Active", "D": "Deleted"}).alias("status")) | • Replace lookup with a dict • cleaner and faster than chained when/then | |
(pl.scan_csv("sales.csv").cast({"amount": pl.Float64}).group_by("region").agg(pl.col("amount").sum(), pl.len().alias("n")).sort("amount", descending=True).collect()) | Cast on scan → group_by → multi-agg → sort → collect | |
df.explode("tags").group_by("tags").agg(pl.len().alias("count")).sort("count", descending=True) | Flatten list column to rows, then count frequencies | |
lf.sink_parquet(pl.PartitionBy("./output/", key="year"), mkdir=True) | Write year-partitioned Parquet for faster downstream filtered scans | |
df_result, timing = lf.profile() | Execute and return both result and per-node timing breakdown for optimization | |
arrow = df.to_arrow(); pandas_df = arrow.to_pandas(zero_copy_only=True) | Convert via Arrow to pandas without data copy where possible | |
(orders.join(customers, on="customer_id", how="left").with_columns(pl.col("revenue").fill_null(0).alias("revenue_clean"))) | Left join to enrich then clean nulls from missing matches | |
df_a.join(df_b, how="cross") | • Produce all row combinations • use .filter() immediately after to avoid explosion | |
df.with_columns(((pl.col("val") - pl.col("val").mean().over("group")) / pl.col("val").std().over("group")).alias("z_score")) | Normalize within each partition using over() | |
df = pl.read_database("SELECT a.*, b.cat FROM a JOIN b ON a.id=b.id", conn) | • Push SQL join to the database • retrieve pre-joined result into Polars |