Apache Pinot Real-Time OLAP Cheat Sheet_v1_tables

Apache Pinot is an open-source distributed OLAP database built for sub-second analytical queries on fresh, large-scale data — originally created at LinkedIn and now used at Uber, Stripe, and hundreds of other organizations. Unlike batch-oriented data warehouses, Pinot is engineered around the constraint that user-facing queries must return within tens of milliseconds even at 100,000+ QPS, ingesting from Kafka or other streams with seconds of latency. The key mental model: Pinot trades write flexibility for read performance — its rich indexing layer (star-tree, inverted, range, geospatial, vector) is selected at table-design time and baked into immutable columnar segments, so query time is bounded by index lookups rather than full scans.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 99 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Table TypesTable 2: Core Architecture — Cluster ComponentsTable 3: Schema Design — Field Categories and Data TypesTable 4: Segment Lifecycle — Generation, Flush, and PushingTable 5: Real-Time Ingestion — Kafka and Stream SourcesTable 6: Indexing — Index Types and When to Use ThemTable 7: Star-Tree Index — Concepts and ConfigurationTable 8: Upsert TablesTable 9: Hybrid Table — Time Boundary and Query RoutingTable 10: Multi-Stage Query Engine (MSE)Table 11: Tenants and Multi-TenancyTable 12: Deep Store OptionsTable 13: Segment Assignment and Routing StrategiesTable 14: Pinot vs. Druid vs. Trino — OLAP ComparisonTable 15: Production Operations and Performance Tuning

Table 1: Table Types

Pinot's fundamental storage abstraction is the table, which can be offline (batch), real-time (streaming), or hybrid (both). Understanding which type to choose — and how hybrid tables stitch the two together with a time boundary — is the starting point for every Pinot deployment.

Type	Example	Description
Real-time table	`"tableType": "REALTIME"`	• Ingests data from a stream (Kafka, Pulsar, Kinesis) • builds segments from consumed messages in memory, then flushes to disk periodically as completed segments.
Offline table	`"tableType": "OFFLINE"`	• Loads pre-built segments pushed from external batch processes (Spark, Hadoop, CLI) • no streaming consumer — suited for historical data with long retention

Table 1: Table Types

Type	Example	Description
Real-time table	`"tableType": "REALTIME"`	• Ingests data from a stream (Kafka, Pulsar, Kinesis) • builds segments from consumed messages in memory, then flushes to disk periodically as completed segments.
Offline table	`"tableType": "OFFLINE"`	• Loads pre-built segments pushed from external batch processes (Spark, Hadoop, CLI) • no streaming consumer — suited for historical data with long retention