Recurrent Neural Networks (RNNs) are a class of neural networks designed specifically for processing sequential data where order and temporal dependencies matter. Unlike feedforward networks that process inputs independently, RNNs maintain an internal hidden state (memory) that gets updated at each time step, allowing the network to capture patterns across sequences of varying lengths. This architecture excels at tasks like language modeling, machine translation, time series forecasting, and speech recognition. The core challenge that motivated LSTM and GRU variants is the vanishing gradient problem—during backpropagation through time (BPTT), gradients shrink exponentially when propagated backward through many time steps, making it nearly impossible for vanilla RNNs to learn long-range dependencies beyond 5-10 steps. Modern gated architectures (LSTM and GRU) solve this by introducing learnable gates that regulate information flow and maintain stable gradient paths through hundreds of time steps.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: RNN Core Architecture and Concepts
Start here to build the mental model everything else rests on—how a recurrent cell carries a hidden state forward, how BPTT unrolls that loop to compute gradients, and why those gradients vanish or explode over long sequences. The bottom rows map out the input/output shapes (many-to-one, seq2seq, and friends) that decide which RNN flavor a given task actually needs.
| Concept | Example | Description |
|---|---|---|
h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h) | • Simplest recurrent unit: combines previous hidden state h_{t-1} with current input x_t using tanh activation• suffers from vanishing gradients for long sequences | |
x1 → RNN → h1, x2 → RNN → h2, ... | • Conceptual view of RNN as a chain of identical cells across time steps • enables gradient computation via BPTT | |
Unroll T steps → compute loss → backprop gradients | • Training algorithm that unfolds RNN across time and applies standard backpropagation • gradients flow backward through time steps | |
Gradient magnitude: (\sigma' W)^T \to 0 as T \to \infty | • Gradients shrink exponentially during BPTT when propagated through many time steps • prevents learning long-term dependencies in vanilla RNNs | |
Gradient magnitude: (\sigma' W)^T \to \infty | • Gradients grow exponentially, causing parameter updates to diverge • mitigated by gradient clipping |