Attention mechanisms enable neural networks to dynamically focus on the most relevant parts of input data by computing weighted combinations based on learned importance scores. Originally introduced for neural machine translation, attention has become the foundational building block powering modern transformers, large language models, and vision systems β enabling them to capture long-range dependencies and contextual relationships that were previously intractable with recurrent or convolutional architectures alone.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 76 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Attention Mechanisms
The foundational attention approaches that shaped modern deep learning β from the earliest additive mechanisms used in sequence-to-sequence models to the scaled dot-product attention that powers transformers and every state-of-the-art LLM deployed today.
| Mechanism | Example | Description |
|---|---|---|
Q, K, V = x @ W_q, x @ W_k, x @ W_vscores = Q @ K.T / sqrt(d_k) | Each token attends to all other tokens in the same sequence; queries, keys, and values all derived from the same input β the core mechanism enabling transformers to model context | |
Q = decoder @ W_qK, V = encoder @ W_k, encoder @ W_v | Queries come from one sequence (e.g., decoder), keys and values from another (e.g., encoder) β used in machine translation, image captioning, and multimodal tasks | |
Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}}) V | Dot product of queries and keys divided by \sqrt{d_k} to prevent gradient saturation; standard attention formula in transformers |