Apache Pinot is an open-source distributed OLAP database built for sub-second analytical queries on fresh, large-scale data β originally created at LinkedIn and now used at Uber, Stripe, and hundreds of other organizations. Unlike batch-oriented data warehouses, Pinot is engineered around the constraint that user-facing queries must return within tens of milliseconds even at 100,000+ QPS, ingesting from Kafka or other streams with seconds of latency. The key mental model: Pinot trades write flexibility for read performance β its rich indexing layer (star-tree, inverted, range, geospatial, vector) is selected at table-design time and baked into immutable columnar segments, so query time is bounded by index lookups rather than full scans.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 99 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Table Types
Pinot's fundamental storage abstraction is the table, which can be offline (batch), real-time (streaming), or hybrid (both). Understanding which type to choose β and how hybrid tables stitch the two together with a time boundary β is the starting point for every Pinot deployment.
| Type | Example | Description |
|---|---|---|
"tableType": "REALTIME" | β’ Ingests data from a stream (Kafka, Pulsar, Kinesis) β’ builds segments from consumed messages in memory, then flushes to disk periodically as completed segments. | |
"tableType": "OFFLINE" | β’ Loads pre-built segments pushed from external batch processes (Spark, Hadoop, CLI) β’ no streaming consumer β suited for historical data with long retention |