Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Recurrent Neural Networks (RNNs LSTMs GRUs) Cheat Sheet

Recurrent Neural Networks (RNNs LSTMs GRUs) Cheat Sheet

Back to AI and Machine Learning
Updated 2026-05-02
Next Topic: Reinforcement Learning Cheat Sheet

Recurrent Neural Networks (RNNs) are a class of neural networks designed specifically for processing sequential data where order and temporal dependencies matter. Unlike feedforward networks that process inputs independently, RNNs maintain an internal hidden state (memory) that gets updated at each time step, allowing the network to capture patterns across sequences of varying lengths. This architecture excels at tasks like language modeling, machine translation, time series forecasting, and speech recognition. The core challenge that motivated LSTM and GRU variants is the vanishing gradient problem—during backpropagation through time (BPTT), gradients shrink exponentially when propagated backward through many time steps, making it nearly impossible for vanilla RNNs to learn long-range dependencies beyond 5-10 steps. Modern gated architectures (LSTM and GRU) solve this by introducing learnable gates that regulate information flow and maintain stable gradient paths through hundreds of time steps.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: RNN Core Architecture and ConceptsTable 2: LSTM Architecture and GatesTable 3: GRU (Gated Recurrent Unit) ArchitectureTable 4: Bidirectional and Stacked RNN ArchitecturesTable 5: Sequence-to-Sequence and Attention MechanismsTable 6: Training Techniques and RegularizationTable 7: Sequence Processing and Data HandlingTable 8: Time Series ApplicationsTable 9: Natural Language Processing ApplicationsTable 10: Computer Vision and Multimodal ApplicationsTable 11: Speech and Audio ApplicationsTable 12: Advanced Architectures and Hybrid ModelsTable 13: Decoding and Inference StrategiesTable 14: Evaluation MetricsTable 15: Common Pitfalls and Best PracticesTable 16: RNNs vs Transformers Comparison

Table 1: RNN Core Architecture and Concepts

Start here to build the mental model everything else rests on—how a recurrent cell carries a hidden state forward, how BPTT unrolls that loop to compute gradients, and why those gradients vanish or explode over long sequences. The bottom rows map out the input/output shapes (many-to-one, seq2seq, and friends) that decide which RNN flavor a given task actually needs.

ConceptExampleDescription
Vanilla RNN cell
h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h)
• Simplest recurrent unit: combines previous hidden state h_{t-1} with current input x_t using tanh activation
• suffers from vanishing gradients for long sequences
Hidden state
h_t = f(h_t-1, x_t)
• Internal memory vector passed between time steps
• encodes information about sequence history up to current position
Unrolling through time
x1 → RNN → h1, x2 → RNN → h2, ...
• Conceptual view of RNN as a chain of identical cells across time steps
• enables gradient computation via BPTT
Backpropagation Through Time (BPTT)
Unroll T steps → compute loss → backprop gradients
• Training algorithm that unfolds RNN across time and applies standard backpropagation
• gradients flow backward through time steps
Vanishing gradient problem
Gradient magnitude: (\sigma' W)^T \to 0 as T \to \infty
• Gradients shrink exponentially during BPTT when propagated through many time steps
• prevents learning long-term dependencies in vanilla RNNs
Exploding gradient problem
Gradient magnitude: (\sigma' W)^T \to \infty
• Gradients grow exponentially, causing parameter updates to diverge
• mitigated by gradient clipping

More in AI and Machine Learning

  • Recommender Systems Cheat Sheet
  • Reinforcement Learning Cheat Sheet
  • AI Bias & Fairness Cheat Sheet
  • Edge AI and TinyML Cheat Sheet
  • Mixture of Experts (MoE) Architecture Cheat Sheet
  • ONNX and ONNX Runtime Cheat Sheet
View all 83 topics in AI and Machine Learning