Stream Processing Fundamentals Cheat Sheet

Updated 2026-05-28

🧠Study flashcards on this topic105 cards · spaced repetition→

Stream processing is the continuous, real-time computation over unbounded data flows, enabling organizations to analyze and act on events as they occur rather than waiting for batch windows. It sits at the intersection of data engineering and real-time systems, powering everything from fraud detection to live dashboards. Understanding stream processing requires mastering the trade-offs between latency and completeness, the semantics of time in distributed systems, and the guarantees your application can make about correctness. The key insight: streaming is batch where the batch never ends—windowing, watermarks, and stateful aggregation let you impose structure on infinite flows while handling the messiness of real-world event arrival patterns.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 107 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Processing ModelsTable 2: Time Semantics and WatermarksTable 3: Windowing StrategiesTable 4: Delivery Guarantees and Processing SemanticsTable 5: Architectural PatternsTable 6: Stateful Operations and State ManagementTable 7: Late Data HandlingTable 8: Backpressure and Flow ControlTable 9: Join Types and PatternsTable 10: Aggregation and Transformation PatternsTable 11: Output Modes and TriggersTable 12: Stream Processing Frameworks and ToolsTable 13: Operational Patterns and Best PracticesTable 14: Advanced ConceptsTable 15: Complex Event Processing (CEP)

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Core Processing Models

The choice between batch, micro-batch, and true streaming determines your latency floor, throughput ceiling, and operational complexity. Modern engines like Flink unify bounded and unbounded processing under one API, making the boundary increasingly blurry in practice.

Model	Example	Description
Stream processing	`stream.keyBy("userId")` `.window(TumblingTime.of(minutes(5)))` `.sum("amount")`	• Processes unbounded data continuously as it arrives • optimized for low latency and real-time action over resource efficiency.
Batch processing	`spark.read.parquet("hdfs://data/")` `.groupBy("id").count()`	• Processes bounded datasets with defined start and finish • optimized for high throughput and historical accuracy over low latency.
Micro-batching	`spark.readStream.trigger(` `Trigger.ProcessingTime("10 seconds"))`	• Collects events into small time-bound batches (e.g., 10s intervals) • balances latency and throughput by processing mini-batches instead of individual records.

Table 1: Core Processing Models

Model	Example	Description
Stream processing	`stream.keyBy("userId")` `.window(TumblingTime.of(minutes(5)))` `.sum("amount")`	• Processes unbounded data continuously as it arrives • optimized for low latency and real-time action over resource efficiency.
Batch processing	`spark.read.parquet("hdfs://data/")` `.groupBy("id").count()`	• Processes bounded datasets with defined start and finish • optimized for high throughput and historical accuracy over low latency.
Micro-batching	`spark.readStream.trigger(` `Trigger.ProcessingTime("10 seconds"))`	• Collects events into small time-bound batches (e.g., 10s intervals) • balances latency and throughput by processing mini-batches instead of individual records.