Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Polars Cheat Sheet

Polars Cheat Sheet

Tables
Back to Data Science
Updated 2026-05-28
Next Topic: Probability Theory Fundamentals Cheat Sheet

Polars is a blazingly fast DataFrame library built in Rust and designed for high-performance data processing. Its multi-threaded query engine uses Apache Arrow columnar memory, supports both eager and lazy execution, and since Polars 1.0 offers a production-stable API. The new streaming engine (2025–2026) processes datasets larger than RAM in batches, optional GPU acceleration via NVIDIA RAPIDS runs queries on CUDA hardware, and Polars Cloud brings distributed execution to any scale. This cheat sheet covers everything from basic DataFrame operations to advanced optimization, streaming sinks, GPU execution, and the latest type system additions including stable Decimal, Int128, Enum, and Array.

Quick Index335 entries · 23 tables
Mind Map

23 tables, 335 concepts. Select a concept node to jump to its table row.

Preparing mind map...

Table 1: Fundamentals

The entry points to Polars — creating DataFrames, reading common file formats, and inspecting data. These methods cover the most frequent first steps in any Polars script or notebook.

MethodExampleDescription
Import Polars
import polars as pl
• Standard import convention
• pl is the universal alias
Create DataFrame
df = pl.DataFrame({"col1": [1, 2], "col2": ["a", "b"]})
Create DataFrame from dict, list of dicts, or Series
Read CSV
df = pl.read_csv("data.csv")
Read CSV into an eager DataFrame with full data in memory
Read Parquet
df = pl.read_parquet("data.parquet")
• Read Parquet into memory
• preferred format for performance
LazyFrame
lf = pl.scan_csv("data.csv")
• Create a lazy query plan without loading data
• enables optimization
Collect Lazy
df = lf.collect()
Execute lazy plan and materialize results into a DataFrame
Head
df.head(10)
Return first n rows for quick inspection
Tail
df.tail(10)
Return last n rows
Shape
df.shape
Returns (rows, columns) tuple
Columns
df.columns
List all column names
Schema
df.schema
Mapping of column names to data types
Dtypes
df.dtypes
List data types for all columns in order
Describe
df.describe()
Summary statistics (count, mean, std, min, max, etc.)
Glimpse
df.glimpse()
• One-line-per-column overview of schema and sample values
• useful for wide DataFrames
Write CSV
df.write_csv("output.csv")
Export DataFrame to CSV
Write Parquet
df.write_parquet("output.parquet")
• Export to Parquet with compression
• best format for most data
Value Counts
df["city"].value_counts(sort=True)
Count occurrences of each unique value in a Series
Read Excel
df = pl.read_excel("data.xlsx", sheet_name="Sheet1")
Read Excel file using the fast calamine engine by default
Write Excel
df.write_excel("output.xlsx", worksheet="data")
Write DataFrame to an Excel file (requires xlsxwriter)
Read Database
df = pl.read_database("SELECT * FROM tbl", conn)
Read SQL query results from any database connection
Read Database URI
df = pl.read_database_uri("SELECT * FROM tbl", "postgresql://user:pw@host/db")
Read from database using a URI connection string (connectorx or ADBC)

Table 2: Expressions and Contexts

Expressions are the building blocks of all Polars queries — reusable, composable descriptions of transformations that the engine evaluates in parallel. Understanding the four contexts (select, with_columns, filter, group_by) is the key mental model for writing idiomatic Polars.

MethodExampleDescription
pl.col
pl.col("age")
• Reference a column by name
• the foundation of all expressions
select
df.select(pl.col("name"), pl.col("age"))
• Select and/or transform columns
• returns new DataFrame
with_columns
df.with_columns(pl.col("age") + 1)
Add or replace columns while keeping all others
filter
df.filter(pl.col("age") > 25)
Keep rows where boolean expression is true
group_by
df.group_by("city").agg(pl.col("age").mean())
Group rows and apply aggregation expressions
alias
pl.col("age").alias("years")
Rename expression result to a new column name
when / then / otherwise
pl.when(pl.col("age") > 18).then(pl.lit("adult")).otherwise(pl.lit("minor"))
Vectorized conditional — equivalent to SQL CASE WHEN
replace
pl.col("code").replace({1: "a", 2: "b"})
• Map specific values to new values
• cleaner than when/then for lookups
cast
pl.col("age").cast(pl.Float64)
Convert column to a different data type
pl.lit
pl.lit(10)
Create a scalar literal expression for use in arithmetic or conditions
pl.len
df.select(pl.len())
• Count rows in context — equivalent to COUNT(*) in SQL
• replaces pl.count()
pl.all()
df.select(pl.all())
• Select all columns
• often used in aggregation (pl.all().sum())
exclude
df.select(pl.all().exclude("id"))
Select all columns except specified ones
expression chaining
pl.col("name").str.to_uppercase().str.strip_chars()
• Chain multiple transformations
• the engine optimizes the full chain
multiple aggregations
df.select(pl.col("age").mean(), pl.col("salary").sum())
Compute multiple aggregations in one pass
pl.int_range
pl.int_range(0, pl.len(), dtype=pl.UInt32).alias("idx")
• Generate integer sequence
• use with pl.len() for row indices
col by regex
df.select(pl.col("^.*_id$"))
Select columns matching a regex pattern via pl.col()
col by dtype
df.select(pl.col(pl.Int64))
Select all columns of a specific data type

Table 3: Lazy vs Eager Execution

Lazy execution is the idiomatic Polars pattern for any non-trivial query. The engine builds a logical plan, applies rewrites (predicate pushdown, projection pruning, CSE), then executes optimally. The new engine parameter on collect() enables streaming or GPU execution without changing query syntax.

MethodExampleDescription
scan_csv (lazy)
lf = pl.scan_csv("data.csv")
• Lazy CSV scan
• does not load data until .collect()
scan_parquet (lazy)
lf = pl.scan_parquet("data.parquet")
Lazy Parquet scan with predicate and projection pushdown
scan_ndjson (lazy)
lf = pl.scan_ndjson("data.ndjson")
Lazy scan for newline-delimited JSON
scan_ipc (lazy)
lf = pl.scan_ipc("data.ipc")
Lazy scan for Apache Arrow IPC/Feather format
lazy() convert
lf = df.lazy()
Convert an eager DataFrame to a LazyFrame
collect()
df = lf.collect()
Execute the lazy plan with all optimizations applied
collect engine=streaming
df = lf.collect(engine="streaming")
• Execute in the new streaming engine
• processes data in batches for larger-than-RAM queries
collect engine=gpu
df = lf.collect(engine="gpu")
• Execute on an NVIDIA GPU via RAPIDS cuDF
• requires pip install polars[gpu]
collect_batches
for batch in lf.collect_batches(chunk_size=50_000): process(batch)
• Streaming generator yielding sub-DataFrames
• use when logic cannot be expressed as a sink
collect_all
r1, r2 = pl.collect_all([lf1, lf2])
Collect multiple LazyFrames with shared CSE optimization
profile()
df, timings = lf.profile()
Execute query and return a timing breakdown per node for performance investigation
explain()
print(lf.explain())
Print the optimized query plan without executing it
show_graph()
lf.show_graph()
Visualize query plan as a graph
fetch()
df = lf.fetch(n_rows=100)
• Execute on first n rows only
• useful for schema checking during development
cache()
lf_cached = lf.cache()
Cache expensive intermediate results when a LazyFrame is reused in multiple branches

Table 4: Query Optimization

Polars applies these optimizations automatically on every lazy query. Understanding them helps you write queries that cooperate with the optimizer rather than fighting it. Most optimizations are on by default; you can inspect which applied using explain().

TechniqueExampleDescription
Predicate Pushdown
lf.filter(pl.col("age") > 25).collect()
Pushes filter conditions down to the source so fewer rows are read
Projection Pushdown
lf.select("name", "age").collect()
Reads only required columns, reducing I/O and memory
Common Subplan Elimination
pl.collect_all([lf.agg1(), lf.agg2()])
Detects shared subplans across multiple LazyFrames and executes them once
Common Subexpr Elimination
lf.with_columns(expr1, expr2) where expr1 and expr2 share sub-expressions
Caches identical sub-expressions within a single query to avoid recomputation
Slice Pushdown
lf.head(100).collect()
Limits rows read from source when .head(), .tail(), or .slice() is used
Collapse Joins
lf.join(other, ...).filter(...)
Merges a join and adjacent filters into a single faster join operation
Cluster With Columns
Multiple sequential with_columns(...) calls
Combines independent sequential with_columns calls into a single pass
cache() Intermediate
lf_cached = lf.cache()
Manually cache an expensive result that is consumed more than once
engine affinity
pl.Config.set_engine_affinity("streaming")
Globally set default execution engine for all subsequent queries
Rechunk
df.rechunk()
Consolidate fragmented memory chunks for better cache locality after incremental builds

Table 5: Data Selection and Filtering

Core row and column selection patterns. These are the operations used most frequently in any Polars workflow — master them before moving to joins and aggregations.

MethodExampleDescription
Filter rows
df.filter(pl.col("age") > 30)
Keep rows matching a boolean expression
AND filter
df.filter((pl.col("age") > 25) & (pl.col("city") == "NYC"))
Combine conditions with & (AND)
OR filter
df.filter((pl.col("age") < 20) | (pl.col("age") > 60))
• Combine conditions with &#124• (OR)
is_in
df.filter(pl.col("city").is_in(["NYC", "LA", "SF"]))
Check if values are in a list
is_between
df.filter(pl.col("age").is_between(20, 30))
Check if values fall within a range (inclusive by default)
Sort
df.sort("age", descending=True)
Sort rows by one or more columns
Sort multiple
df.sort(["city", "age"], descending=[False, True])
Sort by multiple columns with per-column order direction
with_row_index
df.with_row_index("idx")
Add a zero-based integer row index column
Sample
df.sample(n=100, seed=42)
Randomly sample rows with optional reproducibility seed
Slice
df.slice(10, 20)
Extract rows with offset and length
Unique
df.unique(subset=["name", "city"])
Remove duplicate rows based on a subset of columns
Drop Nulls
df.drop_nulls(subset=["age"])
Remove rows with null values in specified columns
gather_every
lf.gather_every(n=2, offset=0).collect()
• Take every nth row
• works on both DataFrame and LazyFrame
Select regex
df.select(pl.col("^.*_id$"))
Select columns matching a regex pattern
Select by dtype
df.select(pl.col(pl.Int64))
Select all columns of a specific data type

Table 6: Joins

Polars joins are highly optimized in both the in-memory and streaming engines. The how="full" outer join (formerly "outer") and the new join_where() for inequality predicates cover the most common relational join patterns.

MethodExampleDescription
Inner Join
df1.join(df2, on="id", how="inner")
Keep only rows with matching keys in both DataFrames
Left Join
df1.join(df2, on="id", how="left")
Keep all rows from left DataFrame, add matching from right
Right Join
df1.join(df2, on="id", how="right")
Keep all rows from right DataFrame, add matching from left
Full Outer Join
df1.join(df2, on="id", how="full", coalesce=True)
• Keep all rows from both
• nulls where no match (renamed from "outer" in 1.0)
Semi Join
df1.join(df2, on="id", how="semi")
• Keep rows from left that have at least one match in right
• no right columns returned
Anti Join
df1.join(df2, on="id", how="anti")
Keep rows from left that have no match in right
Cross Join
df1.join(df2, how="cross")
Cartesian product — all combinations of rows from both
join_where (inequality)
lf1.join_where(lf2, pl.col("dur") < pl.col("time"))
• Inner join on one or more inequality predicates
• supports <, >, <=, >=
AsOf Join
df1.join_asof(df2, on="timestamp", strategy="backward")
• Join on nearest key match
• ideal for time-series alignment
Join multiple keys
df1.join(df2, on=["city", "state"], how="inner")
Join on multiple columns simultaneously
Join different names
df1.join(df2, left_on="id", right_on="user_id")
Join when key column names differ
Join validate
df1.join(df2, on="id", validate="m:1")
Assert join cardinality: '1:1', '1:m', 'm:1', 'm:m'
nulls_equal
df1.join(df2, on="id", nulls_equal=True)
Treat nulls as equal join keys (renamed from join_nulls in 1.24)
Join suffix
df1.join(df2, on="id", suffix="_right")
Suffix appended to duplicate column names from right DataFrame
Concat vertical
pl.concat([df1, df2], how="vertical")
Stack DataFrames row-wise (union, same schema)
Concat horizontal
pl.concat([df1, df2], how="horizontal")
Concatenate DataFrames side-by-side (same row count)

Table 7: Aggregations and Group By

Group-by and aggregation are where Polars' parallel execution shines most. Multiple aggregations in one .agg() call execute in parallel across groups — always batch your aggregations rather than chaining multiple group_by calls.

MethodExampleDescription
group_by single
df.group_by("city").agg(pl.col("age").mean())
Group by one column and apply aggregation
group_by multiple
df.group_by(["city", "state"]).agg(pl.col("age").mean())
Group by multiple columns for hierarchical aggregation
Multiple agg
df.group_by("city").agg(pl.col("age").mean(), pl.col("salary").sum())
• Multiple aggregations in one pass
• always prefer this over chaining
maintain_order
df.group_by("city", maintain_order=True).agg(...)
Preserve input row order in results (slight performance cost)
pl.len (group count)
df.group_by("city").agg(pl.len().alias("count"))
• Count rows per group
• replaces deprecated pl.count()
sum
pl.col("salary").sum()
Sum all non-null values
mean
pl.col("age").mean()
Arithmetic mean
median
pl.col("age").median()
Median (50th percentile)
min / max
pl.col("age").min(), pl.col("age").max()
Minimum and maximum values
std / var
pl.col("age").std(), pl.col("age").var()
Sample standard deviation and variance
first / last
pl.col("name").first(), pl.col("name").last()
First or last value in group
n_unique
pl.col("city").n_unique()
Count distinct values
list agg
pl.col("name").implode()
Collect all values in group into a List
quantile
pl.col("age").quantile(0.75)
Specific quantile (e.g., 75th percentile)
mode
pl.col("category").mode()
• Most frequent value
• may return multiple values on tie
agg filter
pl.col("price").filter(pl.col("status") == "sold").mean()
Apply filter inside aggregation for conditional statistics

Table 8: Window Functions

Window functions compute a value for each row using surrounding rows defined by a partition, without collapsing the DataFrame to one row per group. The .over() expression is Polars' equivalent of SQL PARTITION BY.

MethodExampleDescription
over() partition
pl.col("salary").mean().over("department")
• Compute group aggregate and broadcast back to each row
• no collapse
over() multiple keys
pl.col("salary").rank().over(["dept", "year"])
Partition by multiple columns
row number
pl.int_range(pl.len()).over("group").alias("row_num")
Sequential row number within partition using int_range
rank
pl.col("score").rank().over("category")
• Rank within partition
• handles ties via method parameter
cumulative sum
pl.col("sales").cum_sum().over("month")
Running cumulative sum within each partition
cumulative count
pl.col("*").cum_count().over("group")
Running count of non-null rows within partition
cumulative min / max
pl.col("price").cum_min(), pl.col("price").cum_max()
Running minimum and maximum
shift (lag)
pl.col("price").shift(1).over("stock")
Access previous row values within partition
shift (lead)
pl.col("price").shift(-1).over("stock")
Access next row values (negative n = lead)
diff
pl.col("price").diff().over("stock")
Difference from previous row value
pct_change
pl.col("price").pct_change().over("stock")
Percentage change from previous row
rolling window
pl.col("price").rolling_mean(window_size=3).over("stock")
Sliding window mean within partition
forward fill
pl.col("value").forward_fill().over("group")
Propagate last valid value forward within partition
backward fill
pl.col("value").backward_fill().over("group")
Propagate next valid value backward within partition

Table 9: File I/O and Scanning

Polars supports all major file formats with both eager (read_*/write_*) and lazy (scan_*/sink_*) interfaces. Prefer scan_* + sink_* for large-file workflows — they enable full query optimization and streaming I/O without materializing intermediate results in RAM.

MethodExampleDescription
scan_csv options
pl.scan_csv("data.csv", separator=";", has_header=True)
Lazy CSV scan with delimiter and header options
scan_parquet glob
pl.scan_parquet("data/*.parquet")
Scan multiple Parquet files matching a glob pattern
scan_ndjson
lf = pl.scan_ndjson("data.ndjson")
Lazy scan for newline-delimited JSON
scan_ipc
lf = pl.scan_ipc("data.feather")
Lazy scan for Apache Arrow IPC/Feather format
sink_parquet
lf.sink_parquet("out.parquet")
• Stream lazy query results directly to Parquet
• no full collect needed
sink_csv
lf.sink_csv("out.csv")
Stream lazy query results directly to CSV
sink_ndjson
lf.sink_ndjson("out.ndjson")
Stream lazy query results to newline-delimited JSON
sink_ipc
lf.sink_ipc("out.feather")
Stream lazy query results to Arrow IPC format
PartitionBy
lf.sink_parquet(pl.PartitionBy("./out/", key="year"), mkdir=True)
Write hive-partitioned Parquet using the PartitionBy API
CSV null values
pl.read_csv("data.csv", null_values=["NA", "NULL", ""])
Interpret specific strings as null when reading CSV
CSV skip rows
pl.read_csv("data.csv", skip_rows=5)
Skip header rows before parsing
schema override (read)
pl.read_csv("data.csv", schema_overrides={"age": pl.Int64})
Override inferred column types when reading
Parquet row groups
df.write_parquet("data.parquet", row_group_size=100_000)
Control row group size for query performance optimization
read JSON
df = pl.read_json("data.json")
Read JSON array file into DataFrame
read NDJSON
df = pl.read_ndjson("data.ndjson")
Read newline-delimited JSON (streaming-friendly format)
read IPC / Feather
df = pl.read_ipc("data.feather")
Read Arrow IPC (Feather v2) format — fastest serialization format
read Excel
df = pl.read_excel("data.xlsx", sheet_name="Sales")
Read Excel into DataFrame using the fast Calamine engine by default
write Excel
df.write_excel("report.xlsx", worksheet="Summary")
Write to Excel with optional multi-sheet and formatting support
read database
df = pl.read_database("SELECT * FROM orders", conn)
Read SQL query results from any DBAPI2, SQLAlchemy, or ODBC connection
read database URI
df = pl.read_database_uri("SELECT * FROM orders", "postgresql://user:pw@host/db")
Faster than read_database for large tables via connectorx or ADBC
Hugging Face datasets
pl.scan_parquet("hf://datasets/username/dataset/**")
Scan Hugging Face datasets directly using hf:// URI

Table 10: Schema Handling

Polars enforces a strict, known schema at all times. Getting types right at the boundary (reading, casting, or constructing) pays dividends in performance and correctness throughout your pipeline.

TypeExampleDescription
schema dict
pl.DataFrame(data, schema={"id": pl.Int64, "name": pl.String})
Explicitly define column names and types at construction
schema_overrides
pl.read_csv("data.csv", schema_overrides={"age": pl.Int64})
Override specific inferred types when reading files
cast
df.with_columns(pl.col("age").cast(pl.Int32))
Convert column to a different data type
cast strict=False
pl.col("age").cast(pl.Int64, strict=False)
Allow cast failures to produce nulls instead of errors
rename
df.rename({"old_name": "new_name"})
Rename one or more columns
drop
df.drop("col1", "col2")
Remove specified columns
String type (pl.String)
pl.col("name").cast(pl.String)
• UTF-8 string type
• pl.Utf8 is a backward-compatible alias
Integer types
pl.Int8, pl.Int16, pl.Int32, pl.Int64, pl.Int128
• Signed integers
• Int128 supports values up to ±1.7e38
Unsigned integer types
pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64
• Unsigned integers
• UInt8 is ideal for 0–255 encoded categoricals
Float types
pl.Float32, pl.Float64
• IEEE 754 floating point
• Float16 also available
Decimal type
pl.Decimal(precision=9, scale=2)
• Stable 128-bit exact decimal
• use for financial data where float rounding is unacceptable
Categorical type
pl.col("category").cast(pl.Categorical)
• Dictionary-encoded strings with categories inferred at runtime
• efficient for low-cardinality columns
Enum type
pl.col("priority").cast(pl.Enum(["High", "Medium", "Low"]))
• Ordered categorical with a fixed, predefined set of values
• more performant than Categorical when categories are known upfront
Struct type
pl.col("nested").struct.field("subfield")
Nested data structure with named fields
List type
pl.col("tags").list.len()
Variable-length list of homogeneous values per row
Array type
pl.Series("a", [[1,2],[3,4]], dtype=pl.Array(pl.Int64, 2))
• Fixed-length array
• more efficient than List when all rows have the same length
Temporal types
pl.Date, pl.Datetime, pl.Time, pl.Duration
Date, timestamp, time-of-day, and duration types
null_count
pl.col("age").null_count()
Count null values in a column

Table 11: Streaming and Sinks

The new streaming engine (Polars 1.x, 2025) processes queries in morsel-driven batches, enabling larger-than-RAM workloads. The sink_* methods are the fastest way to write streaming results directly to disk without materializing a full DataFrame in memory.

MethodExampleDescription
collect(engine="streaming")
lf.collect(engine="streaming")
• Execute query in streaming engine
• batched execution with automatic CPU fallback
sink_parquet
lf.sink_parquet("out.parquet")
Stream query results to Parquet without holding full result in memory
sink_csv
lf.sink_csv("out.csv")
Stream query results to CSV
sink_ndjson
lf.sink_ndjson("out.ndjson")
Stream query results to newline-delimited JSON
sink_ipc
lf.sink_ipc("out.feather")
Stream query results to Arrow IPC format
PartitionBy sink
lf.sink_parquet(pl.PartitionBy("./out/", key="year"), mkdir=True)
Write hive-partitioned output in a single streaming pass
collect_batches
for batch in lf.collect_batches(chunk_size=50_000): process(batch)
• Generator of DataFrames
• use only when custom per-batch Python logic is needed
set engine affinity
pl.Config.set_engine_affinity("streaming")
Set streaming as default engine for all subsequent queries in the session
streaming large filter
pl.scan_parquet("huge.parquet").filter(...).sink_parquet("filtered.parquet")
End-to-end streaming pipeline: scan → filter → sink without RAM overflow
streaming group by
lf.group_by("key").agg(...).collect(engine="streaming")
Group-by aggregation in streaming mode
collect_all CSE
pl.collect_all([lf1, lf2])
• Collect multiple LazyFrames in one call
• shared subplans execute only once

Table 12: Performance Tuning

Polars is fast by default, but these patterns maximize throughput and minimize memory for production workloads. The most impactful wins are: use lazy API with sinks, avoid intermediate collects, use Categorical for string columns, and target the streaming or GPU engine for bottleneck queries.

TechniqueExampleDescription
GPU engine
lf.collect(engine="gpu")
• Execute on NVIDIA GPU via RAPIDS cuDF
• install with pip install polars[gpu]
• auto-fallback to CPU if unsupported
GPU engine object
lf.collect(engine=pl.GPUEngine(device=0))
• Target a specific GPU device
• raise_on_fail=True disables CPU fallback
Lazy for large data
pl.scan_parquet("large.parquet").filter(...).sink_parquet("out.parquet")
Use lazy API with sinks for large data — avoids loading entire result into RAM
Avoid intermediate collects
Build full lf.filter().select().group_by()... chain before calling .collect()
• Each .collect() breaks optimization boundaries
• chain operations before materializing
Categorical for strings
df.with_columns(pl.col("category").cast(pl.Categorical))
• Encodes low-cardinality string columns as integers
• reduces memory and speeds joins/group_by
Enum for known categories
df.with_columns(pl.col("status").cast(pl.Enum(["open","closed"])))
More performant than Categorical when the set of values is fixed and known upfront
Filter early
pl.scan_parquet("data.parquet").filter(...).select(...).collect()
Even if the optimizer does pushdown, explicit early filtering makes intent clear
Projection early
pl.scan_csv("data.csv").select("col1", "col2").collect()
• Select only needed columns early
• optimizer prunes at source
collect_all
pl.collect_all([lf1, lf2, lf3])
• Process multiple LazyFrames together
• CSE avoids redundant source reads
Parallel execution
Automatic across all available CPU cores
• Polars uses a Rust thread pool
• no configuration needed
rechunk
df = df.rechunk()
Consolidate fragmented Arrow buffers into contiguous memory for better cache performance
estimated_size
df.estimated_size(unit="mb")
Estimate RAM footprint of a DataFrame in bytes, KB, MB, or GB
Partitioned writes
lf.sink_parquet(pl.PartitionBy("./data/", key="year"))
Partition output by column for faster filtered reads on subsequent queries
String cache
pl.enable_string_cache()
• Enable global string cache
• needed when joining on Categorical columns from different DataFrames

Table 13: Interoperability

Polars uses Apache Arrow as its internal memory format, enabling zero-copy data exchange with the broader Arrow ecosystem. Converting to/from pandas adds a copy overhead but is well-supported.

MethodExampleDescription
From pandas
df = pl.from_pandas(pandas_df)
• Convert pandas DataFrame to Polars via Arrow
• copies data
To pandas
pandas_df = df.to_pandas()
• Convert to pandas DataFrame
• copies data
From Arrow
df = pl.from_arrow(arrow_table)
Create Polars DataFrame from PyArrow Table — zero-copy
To Arrow
arrow_table = df.to_arrow()
Convert to PyArrow Table — zero-copy
From NumPy
df = pl.from_numpy(np_array, schema=["col1", "col2"])
Create DataFrame from NumPy array
To NumPy
np_array = df.to_numpy()
• Convert DataFrame to NumPy array
• copies data
From dict
df = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
Create from Python dictionary
To dict
data = df.to_dict(as_series=False)
Convert to Python dictionary of lists
From records
df = pl.from_records([{"a": 1, "b": 2}, {"a": 3, "b": 4}])
Create from list of row dicts
Read database
df = pl.read_database("SELECT * FROM tbl", conn)
Read from DBAPI2, SQLAlchemy, ODBC, ADBC, or async drivers
Hugging Face
pl.scan_parquet("hf://datasets/user/dataset/**")
Lazy scan Hugging Face datasets using the hf:// URI scheme
Arrow memory format
Polars uses Apache Arrow columnar format internally
Zero-copy sharing with DuckDB, Apache Arrow Flight, PyArrow, and cuDF

Table 14: String Operations

All string operations live under the .str namespace and are vectorized over the full column. They work in all contexts (select, with_columns, filter) and compose with other expressions.

MethodExampleDescription
contains
pl.col("text").str.contains("pattern")
Check if string contains a substring or regex pattern
starts_with / ends_with
pl.col("text").str.starts_with("prefix"), .str.ends_with("suffix")
Check string prefix or suffix
to_uppercase / to_lowercase
pl.col("name").str.to_uppercase()
Convert string case
strip_chars
pl.col("text").str.strip_chars()
Remove leading and trailing whitespace (or specified characters)
replace
pl.col("text").str.replace("old", "new")
Replace first occurrence of pattern
replace_all
pl.col("text").str.replace_all("old", "new")
Replace all occurrences of pattern
replace_many
pl.col("text").str.replace_many(["a","b"], ["A","B"])
Replace multiple patterns in one pass using Aho-Corasick algorithm
split
pl.col("text").str.split(",")
Split string into a List by delimiter
extract
pl.col("text").str.extract(r"(\d+)", group_index=1)
Extract first regex match capture group
extract_all
pl.col("text").str.extract_all(r"\d+")
Extract all regex matches as a List
len_chars
pl.col("text").str.len_chars()
• Character count (Unicode-aware)
• use len_bytes() for byte length
slice
pl.col("text").str.slice(0, 5)
Extract substring by start position and length
concat_str
pl.concat_str([pl.col("first"), pl.col("last")], separator=" ")
Concatenate multiple columns into one string with separator
strptime
pl.col("date_str").str.strptime(pl.Date, "%Y-%m-%d")
Parse string to date/datetime using a format string
to_integer
pl.col("num_str").str.to_integer(base=10)
Parse numeric strings to integer type

Table 15: Datetime Operations

All datetime operations live under the .dt namespace. Polars supports Date, Datetime, Time, and Duration types with full timezone support. Calendar-aware arithmetic ("1mo", "1y") correctly handles month-boundary and DST cases.

MethodExampleDescription
strptime (parse)
pl.col("date_str").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")
Parse string to Date or Datetime type
strftime (format)
pl.col("date").dt.strftime("%Y-%m-%d")
Format datetime as string
date components
pl.col("date").dt.year(), .dt.month(), .dt.day()
Extract year, month, day components
time components
pl.col("ts").dt.hour(), .dt.minute(), .dt.second()
Extract hour, minute, second components
date() / time()
pl.col("ts").dt.date(), pl.col("ts").dt.time()
Extract only the date or time part from a datetime
weekday
pl.col("date").dt.weekday()
Day of week (1=Monday, 7=Sunday)
ordinal_day
pl.col("date").dt.ordinal_day()
Day of year (1–366)
truncate
pl.col("ts").dt.truncate("1h")
• Round down to nearest time unit
• e.g., "1d", "1h", "15m"
round
pl.col("ts").dt.round("1d")
Round to nearest time unit
offset_by
pl.col("date").dt.offset_by("1mo")
Calendar-aware date arithmetic: add "1mo", "1y", "7d", etc.
date_range
pl.date_range(pl.date(2024, 1, 1), pl.date(2024, 12, 31), interval="1d")
Generate a series of dates at a regular interval
datetime_range
pl.datetime_range(start, end, interval="1h", eager=True)
• Generate datetime series
• supports timezone-aware intervals
duration subtraction
pl.col("end") - pl.col("start")
Subtract two datetimes to get a Duration
total_seconds
pl.col("dur").dt.total_seconds()
Convert Duration to total seconds as integer
timestamp
pl.col("ts").dt.timestamp("ms")
Convert datetime to Unix timestamp in specified time unit

Table 16: List Operations

The list namespace enables vectorized operations over variable-length List columns, which is Polars' primary mechanism for nested data without exploding rows. This is significantly more memory-efficient than unnesting for most aggregation tasks.

MethodExampleDescription
list.len
pl.col("tags").list.len()
Count elements in each list
list.sum / mean / min / max
pl.col("scores").list.sum(), .list.mean()
Aggregate all elements in each list
list.get
pl.col("tags").list.get(0)
Get element at index (negative index supported)
list.first / last
pl.col("tags").list.first(), .list.last()
Get first or last element of each list
list.contains
pl.col("tags").list.contains("python")
Boolean check if any element equals target value
list.sort
pl.col("scores").list.sort()
Sort each list ascending or descending
list.unique
pl.col("tags").list.unique()
Return unique elements from each list
list.slice
pl.col("items").list.slice(1, 3)
Extract sub-list from each row by offset and length
list.head / tail
pl.col("items").list.head(2), .list.tail(2)
Take first or last N elements from each list
list.concat
pl.col("list1").list.concat(pl.col("list2"))
Concatenate two list columns element-wise
list.explode
df.explode("tags")
Expand list column into multiple rows (one per element)
list.gather
pl.col("items").list.gather([0, 2])
Select elements at specified indices from each list
list.sample
pl.col("items").list.sample(n=2)
Randomly sample N elements from each list
list.eval
pl.col("prices").list.eval(pl.element() * 1.1)
Apply an expression to each list as if it were a miniature Series
list.set_union
pl.col("a").list.set_union(pl.col("b"))
Set union of two list columns element-wise
list.set_intersection
pl.col("a").list.set_intersection(pl.col("b"))
Set intersection of two list columns element-wise

Table 17: Null Handling

Polars uses Arrow's null bitmask — nulls are not stored as NaN or sentinel values, but as a separate validity mask. This means null != NaN; NaN is a valid Float64 value while null represents missing data of any type.

MethodExampleDescription
is_null / is_not_null
pl.col("a").is_null(), pl.col("a").is_not_null()
Boolean mask for null or non-null values
fill_null (literal)
pl.col("a").fill_null(0)
Replace nulls with a literal value
fill_null (strategy)
pl.col("a").fill_null(strategy="forward")
Strategies: "forward", "backward", "mean", "min", "max", "zero", "one"
fill_null (expression)
pl.col("a").fill_null(pl.col("b"))
Fill nulls from another column
forward_fill
pl.col("price").forward_fill()
Propagate last valid value forward to fill nulls
backward_fill
pl.col("price").backward_fill()
Propagate next valid value backward to fill nulls
drop_nulls
df.drop_nulls(subset=["col1", "col2"])
Drop rows containing nulls (in any or specified columns)
null_count
df.null_count()
Count nulls per column across the full DataFrame
fill_nan
pl.col("float_col").fill_nan(0.0)
• Replace NaN values specifically — distinct from fill_null
• affects float columns only
is_nan / is_not_nan
pl.col("val").is_nan(), pl.col("val").is_not_nan()
• Detect NaN in float columns
• NaN is not null
coalesce
pl.coalesce([pl.col("a"), pl.col("b"), pl.lit(0)])
Return first non-null value across expressions (SQL-style COALESCE)

Table 18: Rolling and Time-Series Aggregations

Rolling window functions process a sliding window of rows, computing a statistic over each window position. Polars supports both row-count-based windows and time-duration-based windows via the _by variants.

MethodExampleDescription
rolling_mean
pl.col("price").rolling_mean(window_size=7)
Moving average over a fixed number of rows
rolling_sum
pl.col("sales").rolling_sum(window_size=30)
Moving sum over a fixed number of rows
rolling_min / max
pl.col("price").rolling_min(window_size=7), .rolling_max(window_size=7)
Moving minimum and maximum
rolling_std / var
pl.col("price").rolling_std(window_size=10)
Moving standard deviation and variance
rolling_mean_by (time)
pl.col("value").rolling_mean_by("timestamp", window_size="2h")
• Time-aware rolling mean
• window defined by duration, not row count
rolling_sum_by (time)
pl.col("sales").rolling_sum_by("date", window_size="7d")
Time-aware rolling sum over a date/datetime column
ewm_mean
pl.col("price").ewm_mean(span=20)
• Exponentially weighted moving average
• span, alpha, or halflife parameter
ewm_std
pl.col("price").ewm_std(span=20)
Exponentially weighted standard deviation
group_by_dynamic
df.group_by_dynamic("date", every="1w").agg(pl.col("sales").sum())
Time-bucket aggregation: group by dynamic time windows
rolling_min_by / max_by
pl.col("price").rolling_min_by("timestamp", window_size="1d")
Time-aware rolling minimum and maximum

Table 19: Pivoting and Reshaping

Reshaping between wide and long formats is a fundamental preprocessing step. Polars' unpivot() (replacing deprecated melt()) converts wide to long, while pivot() goes long to wide — note that pivot() is an eager operation that requires schema knowledge upfront.

MethodExampleDescription
unpivot (wide to long)
df.unpivot(on=["jan","feb","mar"], index="product")
• Reshape wide to long format
• on is the value columns, index is the id columns
pivot (long to wide)
df.pivot(on="month", index="product", values="sales")
• Reshape long to wide
• column values become column headers
pivot aggregate
df.pivot(on="month", index="product", values="sales", aggregate_function="sum")
Aggregate duplicate key combinations during pivot
explode
df.explode("tags")
Expand List column to multiple rows — one row per element
unnest
df.unnest("struct_col")
Flatten Struct column into individual top-level columns
concat vertical
pl.concat([df1, df2], how="vertical")
• Stack DataFrames vertically
• schemas must match
concat horizontal
pl.concat([df1, df2], how="horizontal")
Join DataFrames side-by-side by row position
concat diagonal
pl.concat([df1, df2], how="diagonal")
• Stack DataFrames with schema union
• missing columns become null
transpose
df.transpose(include_header=True, header_name="field")
• Flip rows and columns
• all values become the same type
melt (legacy)
df.melt(id_vars=["id"], value_vars=["a","b"])
Deprecated since Polars 1.0 — use unpivot(on=[...], index=[...]) instead

Table 20: Advanced Operations

These are powerful but less frequently used operations that enable complex data transformations. map_elements is an escape hatch for Python-level logic but should be avoided in performance-critical paths; prefer native expressions.

MethodExampleDescription
when / then / otherwise
pl.when(pl.col("age") > 18).then(pl.lit("adult")).otherwise(pl.lit("minor"))
• Vectorized conditional logic
• SQL CASE WHEN equivalent
Expr.replace
pl.col("status").replace({"A": "Active", "I": "Inactive"})
• Map values via dict
• cleaner than when/then chains for simple lookups
Expr.replace_strict
pl.col("code").replace_strict({1: "X", 2: "Y"}, default="other", return_dtype=pl.String)
• Like replace but enforces complete mapping
• unmapped values raise error unless default is set
join_where
lf1.join_where(lf2, pl.col("a") < pl.col("b"), pl.col("c") == pl.col("d"))
• Inequality and mixed-condition joins
• use where equi-join is insufficient
gather (by indices)
lf.gather([0, 5, 10])
• Select rows by integer index array
• lazy equivalent of row-position selection
map_elements
pl.col("col").map_elements(lambda x: x * 2, return_dtype=pl.Int64)
• Apply arbitrary Python function per-element
• slow, avoid in hot paths
map_batches
pl.col("col").map_batches(lambda s: custom_fn(s))
• Apply a function to the whole Series at once
• faster than map_elements
struct creation
pl.struct(["col1", "col2"]).alias("nested")
Combine multiple columns into a single Struct column
struct field access
pl.col("struct_col").struct.field("fieldname")
Access a named field inside a Struct column
Sample rows
df.sample(n=100, seed=42)
Randomly sample N rows
apply (group-wise)
df.group_by("grp").map_groups(fn)
• Apply a Python function to each group as a DataFrame
• slow but flexible
unique combinations
df.unique(subset=["col1", "col2"], keep="first")
• Deduplicate rows by subset
• keep can be "first", "last", or "any"

Table 21: Statistical Functions

Statistical expressions work in any Polars context and compose with group-by, window functions, and filters. describe() is useful for quick EDA; for production pipelines, use individual aggregation expressions to avoid collecting everything into Python.

FunctionExampleDescription
describe
df.describe()
Summary statistics (count, mean, std, min, max, percentiles) for all columns
corr
pl.pearson_corr("col1", "col2")
Pearson correlation coefficient between two columns
spearman_rank_corr
pl.spearman_rank_corr("col1", "col2")
• Spearman rank correlation
• more robust than Pearson for non-linear data
quantile
pl.col("age").quantile(0.9, interpolation="linear")
Percentile with interpolation methods: "linear", "nearest", "lower", "higher", "midpoint"
std / var
pl.col("price").std(ddof=1), .var(ddof=1)
• Standard deviation and variance
• ddof=1 for sample, ddof=0 for population
skew
pl.col("price").skew()
Measure of asymmetry in distribution
kurtosis
pl.col("price").kurtosis()
Measure of tail heaviness (excess kurtosis by default)
entropy
pl.col("probs").entropy(base=2)
• Shannon entropy
• base=2 for bits, base=math.e for nats
value_counts
df["category"].value_counts(sort=True)
Frequency table with optional sort by count
arg_sort
pl.col("price").arg_sort()
Return row indices that would sort the column
arg_max / arg_min
pl.col("price").arg_max(), .arg_min()
Index of the maximum or minimum value
n_unique
pl.col("category").n_unique()
Count of distinct values
approx_n_unique
pl.col("user_id").approx_n_unique()
• Fast approximate distinct count using HyperLogLog
• much faster for large columns
mode
pl.col("category").mode()
Most frequently occurring value(s)

Table 22: Column Selectors

The cs module (polars.selectors) provides semantic column selection patterns — far more readable than manual type checks or regex filters on column names. Selectors compose with set operators to build complex selection logic in one expression.

SelectorExampleDescription
cs.numeric()
df.select(cs.numeric())
Select all numeric columns (integer and float types)
cs.string()
df.select(cs.string())
Select all String/Utf8 columns
cs.boolean()
df.select(cs.boolean())
Select all Boolean columns
cs.temporal()
df.select(cs.temporal())
Select all temporal columns (Date, Datetime, Duration, Time)
cs.by_dtype()
df.select(cs.by_dtype(pl.Float64, pl.Float32))
Select columns matching specified dtype(s)
cs.by_name()
df.select(cs.by_name("a", "b", "c"))
Select columns by exact name
cs.starts_with()
df.select(cs.starts_with("sales_"))
Select columns whose names start with a prefix
cs.ends_with()
df.select(cs.ends_with("_id"))
Select columns whose names end with a suffix
cs.contains()
df.select(cs.contains("2024"))
Select columns whose names contain a substring
cs.matches()
df.select(cs.matches(r"^q[1-4]_"))
Select columns whose names match a regex pattern
cs.all()
df.select(cs.all())
Select all columns (equivalent to pl.col("*"))
cs.first() / last()
df.select(cs.first()), df.select(cs.last())
Select only the first or last column
Selector union
df.select(cs.numeric() | cs.boolean())
• Combine selectors with set union (&#124• )
Selector intersection
df.select(cs.numeric() & cs.starts_with("q"))
Columns matching both selectors using &
Selector negation
df.select(~cs.numeric())
Select all columns except those matching the selector
Selector difference
df.select(cs.numeric() - cs.by_name("id"))
Numeric columns excluding the id column
cs.expand_selector()
cs.expand_selector(df, cs.numeric())
Resolve selector to an explicit list of column names

Table 23: Practical Patterns and Examples

These are complete, idiomatic multi-step patterns that combine Polars primitives for common real-world tasks. Each pattern follows the principle: push all work into a single lazy query, optimize it, and materialize only once.

TechniqueExampleDescription
End-to-end streaming pipeline
pl.scan_parquet("huge/*.parquet").filter(pl.col("year") == 2024).select("id","sales").sink_parquet("filtered.parquet")
Full scan → filter → project → sink pipeline without loading into RAM
GPU-accelerated pipeline
pl.scan_parquet("data.parquet").group_by("key").agg(pl.col("val").sum()).collect(engine="gpu")
• Execute aggregation on NVIDIA GPU via cuDF
• auto-falls back to CPU on failure
Multi-LazyFrame CSE
pl.collect_all([lf_sales, lf_returns, lf_inventory])
Execute multiple related queries sharing common subexpressions in one pass
Conditional column creation
df.with_columns(pl.when(pl.col("score") > 90).then("A").when(pl.col("score") > 80).then("B").otherwise("C").alias("grade"))
Multi-branch conditional assignment with chained when/then
Dict-based column mapping
df.with_columns(pl.col("code").replace({"A": "Active", "D": "Deleted"}).alias("status"))
• Replace lookup with a dict
• cleaner and faster than chained when/then
Type-safe aggregation pipeline
(pl.scan_csv("sales.csv").cast({"amount": pl.Float64}).group_by("region").agg(pl.col("amount").sum(), pl.len().alias("n")).sort("amount", descending=True).collect())
Cast on scan → group_by → multi-agg → sort → collect
Explode + aggregate list column
df.explode("tags").group_by("tags").agg(pl.len().alias("count")).sort("count", descending=True)
Flatten list column to rows, then count frequencies
Hive-partitioned write
lf.sink_parquet(pl.PartitionBy("./output/", key="year"), mkdir=True)
Write year-partitioned Parquet for faster downstream filtered scans
Profile a query
df_result, timing = lf.profile()
Execute and return both result and per-node timing breakdown for optimization
Pandas interop with zero copy
arrow = df.to_arrow(); pandas_df = arrow.to_pandas(zero_copy_only=True)
Convert via Arrow to pandas without data copy where possible
Join + enrich pattern
(orders.join(customers, on="customer_id", how="left").with_columns(pl.col("revenue").fill_null(0).alias("revenue_clean")))
Left join to enrich then clean nulls from missing matches
Cross join (cartesian)
df_a.join(df_b, how="cross")
• Produce all row combinations
• use .filter() immediately after to avoid explosion
Rolling z-score per group
df.with_columns(((pl.col("val") - pl.col("val").mean().over("group")) / pl.col("val").std().over("group")).alias("z_score"))
Normalize within each partition using over()
Read multiple DB tables
df = pl.read_database("SELECT a.*, b.cat FROM a JOIN b ON a.id=b.id", conn)
• Push SQL join to the database
• retrieve pre-joined result into Polars
Back to Data Science
Next Topic: Probability Theory Fundamentals Cheat Sheet

More in Data Science

  • Plotly and Dask Cheat Sheet
  • Probability Theory Fundamentals Cheat Sheet
  • AB Testing and Online Experimentation Cheat Sheet
  • Design of Experiments (DOE) Cheat Sheet
  • Network Analysis with NetworkX Cheat Sheet
  • SciPy Cheat Sheet
View all 47 topics in Data Science

References

Official Documentation

  1. Polars Python API Reference
  2. Polars User Guide
  3. Polars LazyFrame API
  4. Polars DataFrame API
  5. Polars Expressions API
  6. Polars Data Types Reference
  7. Polars Column Selectors
  8. Polars Lazy API User Guide
  9. Polars Streaming Guide
  10. Polars GPU Support
  11. Polars Query Optimization Guide
  12. Polars Joins User Guide
  13. Polars Aggregation Guide
  14. Polars I/O Overview
  15. Polars Arrow Interoperability
  16. Polars Multiprocessing Guide
  17. pl.scan_csv API
  18. pl.scan_parquet API
  19. pl.scan_ndjson API
  20. pl.scan_ipc API
  21. pl.read_csv API
  22. pl.read_parquet API
  23. pl.read_excel API
  24. pl.read_database API
  25. pl.read_database_uri API
  26. pl.read_json API
  27. pl.read_ndjson API
  28. pl.read_ipc API
  29. DataFrame.write_parquet API
  30. DataFrame.write_excel API
  31. LazyFrame.sink_parquet API
  32. LazyFrame.sink_csv API
  33. LazyFrame.sink_ndjson API
  34. LazyFrame.sink_ipc API
  35. LazyFrame.collect API
  36. LazyFrame.collect_batches API
  37. pl.collect_all API
  38. LazyFrame.profile API
  39. LazyFrame.explain API
  40. LazyFrame.join API
  41. LazyFrame.join_where API
  42. LazyFrame.gather API
  43. LazyFrame.gather_every API
  44. DataFrame.group_by API
  45. DataFrame.group_by_dynamic API
  46. DataFrame.join API
  47. DataFrame.filter API
  48. DataFrame.with_columns API
  49. DataFrame.select API
  50. DataFrame.sort API
  51. DataFrame.unique API
  52. DataFrame.unpivot API
  53. DataFrame.pivot API
  54. DataFrame.explode API
  55. DataFrame.unnest API
  56. DataFrame.transpose API
  57. DataFrame.rename API
  58. DataFrame.drop API
  59. DataFrame.sample API
  60. DataFrame.describe API
  61. DataFrame.rechunk API
  62. DataFrame.estimated_size API
  63. DataFrame.null_count API
  64. DataFrame.drop_nulls API
  65. DataFrame.with_row_index API
  66. Expr.over API
  67. Expr.cast API
  68. Expr.alias API
  69. Expr.filter API
  70. Expr.sort_by API
  71. Expr.fill_null API
  72. Expr.fill_nan API
  73. Expr.is_null API
  74. Expr.is_nan API
  75. Expr.forward_fill API
  76. Expr.backward_fill API
  77. Expr.replace API
  78. Expr.replace_strict API
  79. Expr.shift API
  80. Expr.diff API
  81. Expr.pct_change API
  82. Expr.rank API
  83. Expr.arg_sort API
  84. Expr.rolling_mean API
  85. Expr.rolling_sum API
  86. Expr.rolling_mean_by API
  87. Expr.rolling_sum_by API
  88. Expr.rolling_min_by API
  89. Expr.ewm_mean API
  90. Expr.ewm_std API
  91. Expr.cum_sum API
  92. Expr.cum_min API
  93. Expr.quantile API
  94. Expr.std API
  95. Expr.skew API
  96. Expr.kurtosis API
  97. Expr.entropy API
  98. Expr.approx_n_unique API
  99. Expr.n_unique API
  100. Expr.pearson_corr API
  101. Expr.spearman_rank_corr API
  102. Expr.map_elements API
  103. Expr.map_batches API
  104. Expr.implode API
  105. Expr.struct.field API
  106. Expr.str.contains API
  107. Expr.str.starts_with API
  108. Expr.str.replace API
  109. Expr.str.replace_all API
  110. Expr.str.replace_many API
  111. Expr.str.split API
  112. Expr.str.extract API
  113. Expr.str.extract_all API
  114. Expr.str.len_chars API
  115. Expr.str.slice API
  116. Expr.str.strptime API
  117. Expr.str.to_integer API
  118. Expr.str.to_uppercase API
  119. Expr.str.strip_chars API
  120. Expr.dt.year API
  121. Expr.dt.hour API
  122. Expr.dt.date API
  123. Expr.dt.strftime API
  124. Expr.dt.truncate API
  125. Expr.dt.round API
  126. Expr.dt.offset_by API
  127. Expr.dt.weekday API
  128. Expr.dt.total_seconds API
  129. Expr.dt.timestamp API
  130. Expr.list.len API
  131. Expr.list.get API
  132. Expr.list.contains API
  133. Expr.list.sort API
  134. Expr.list.eval API
  135. Expr.list.gather API
  136. Expr.list.sample API
  137. Expr.list.set_union API
  138. pl.when API
  139. pl.len API
  140. pl.int_range API
  141. pl.concat_str API
  142. pl.concat API
  143. pl.struct API
  144. pl.coalesce API
  145. pl.date_range API
  146. pl.datetime_range API
  147. pl.from_pandas API
  148. pl.from_arrow API
  149. pl.enable_string_cache API
  150. pl.Decimal type reference
  151. pl.Categorical type reference
  152. pl.Enum type reference
  153. pl.Array type reference
  154. Series.value_counts API
  155. Polars Changelog / Releases
  156. Polars Understanding Data Types (User Guide)

Technical Blogs & Tutorials

  1. Polars Blog: Announcing Polars 1.0
  2. Polars Blog: Polars Streaming — The Next Generation (Dec 2025)
  3. Polars Blog: GPU Support with RAPIDS cuDF
  4. Polars Blog: Versioning and Upgrade Guide
  5. Towards Data Science: Polars vs Pandas — Which Should You Choose in 2025?
  6. Real Python: Polars vs Pandas 2026
  7. Medium: Polars 1.0 Breaking Changes — What You Need to Know
  8. Practical Data Science: Using Polars GPU Engine with RAPIDS cuDF
  9. Ander Steele Blog: Polars Group By Aggregations Deep Dive
  10. Christophe Rangeon: Polars Window Functions Explained
  11. Kevin Kho: Getting Started with Polars Lazy API
  12. Calmcode: Polars Video Tutorials
  13. stefanbschneider: Polars Cheat Sheet on GitHub
  14. Pola.rs Community Forum
  15. Stack Overflow Polars Tag
  16. Polars Cookbook Examples
  17. Towards Data Science: Polars vs DuckDB for Data Engineering
  18. Data School: Polars Tutorial for Beginners
  19. Level Up Coding: Polars — 10 Tips and Tricks for 2026

GitHub Repositories & Code Examples

  1. pola-rs/polars GitHub Repository
  2. pola-rs/polars — Python package source
  3. pola-rs/polars — Release Notes
  4. pola-rs/polars-book — User Guide Source
  5. pola-rs/polars — Changelog
  6. pola-rs/polars — GPU Integration Tests
  7. pola-rs/polars — Streaming Engine Source
  8. jorgecarleitao/polars Examples
  9. machow/polars-cookbook
  10. realpython/materials — Polars Code Samples

Video Resources

  1. YouTube: Polars Tutorial 2025 — Full Crash Course (Rob Mulla)
  2. YouTube: Polars vs Pandas — Side by Side Comparison (Python Programmer)
  3. YouTube: Polars Lazy API Deep Dive (DataTalks.Club)
  4. YouTube: Polars GPU Support Demo (NVIDIA cuDF)
  5. YouTube: Polars Group By and Window Functions (Khuyen Tran)
  6. YouTube: Polars Streaming for Large Datasets (Data With Mo)
  7. YouTube: Polars — The Pandas Killer? Benchmarks 2026 (Mikkel Jans)

Industry Best Practice Guides & Books

  1. O'Reilly: Data Engineering with Polars (2025)
  2. Polars Official Discord Community
  3. Apache Arrow Columnar Format Specification
  4. RAPIDS cuDF Documentation
  5. connectorx Documentation
  6. ADBC (Arrow Database Connectivity) Documentation
  7. Calamine Excel Reader Documentation
  8. FastExcel Documentation (Polars Excel engine)