Neural Network Attention Mechanisms Cheat Sheet

Updated 2026-05-18

Next Topic: Neural Networks Architecture Cheat Sheet

Attention mechanisms enable neural networks to dynamically focus on the most relevant parts of input data by computing weighted combinations based on learned importance scores. Originally introduced for neural machine translation, attention has become the foundational building block powering modern transformers, large language models, and vision systems — enabling them to capture long-range dependencies and contextual relationships that were previously intractable with recurrent or convolutional architectures alone.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 76 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Attention MechanismsTable 2: Query, Key, and Value ComputationTable 3: Positional Encoding VariantsTable 4: Attention Patterns and SparsityTable 5: Efficient Attention OptimizationsTable 6: Attention in Vision TransformersTable 7: Attention Masking StrategiesTable 8: Attention Weights and NormalizationTable 9: Advanced Attention VariantsTable 10: Attention Pooling and AggregationTable 11: Attention Interpretability and AnalysisTable 12: Attention Computational ComplexityTable 13: Attention in LLM InferenceTable 14: Attention Challenges and PhenomenaTable 15: Attention Weight Management

Table 1: Core Attention Mechanisms

The foundational attention approaches that shaped modern deep learning — from the earliest additive mechanisms used in sequence-to-sequence models to the scaled dot-product attention that powers transformers and every state-of-the-art LLM deployed today.

Mechanism	Example	Description
Self-Attention	`Q, K, V = x @ W_q, x @ W_k, x @ W_v` `scores = Q @ K.T / sqrt(d_k)`	• Each token attends to all other tokens in the same sequence • queries, keys, and values all derived from the same input — the core mechanism enabling transformers to model context
Cross-Attention	`Q = decoder @ W_q` `K, V = encoder @ W_k, encoder @ W_v`	Queries come from one sequence (e.g., decoder), keys and values from another (e.g., encoder) — used in machine translation, image captioning, and multimodal tasks
Scaled Dot-Product Attention	`Attention(Q,K,V) = softmax($\frac{QK^T}{\sqrt{d_k}}$) V`	• Dot product of queries and keys divided by $\sqrt{d_k}$ to prevent gradient saturation • standard attention formula in transformers

Table 1: Core Attention Mechanisms

Mechanism	Example	Description
Self-Attention	`Q, K, V = x @ W_q, x @ W_k, x @ W_v` `scores = Q @ K.T / sqrt(d_k)`	• Each token attends to all other tokens in the same sequence • queries, keys, and values all derived from the same input — the core mechanism enabling transformers to model context
Cross-Attention	`Q = decoder @ W_q` `K, V = encoder @ W_k, encoder @ W_v`	Queries come from one sequence (e.g., decoder), keys and values from another (e.g., encoder) — used in machine translation, image captioning, and multimodal tasks
Scaled Dot-Product Attention	`Attention(Q,K,V) = softmax($\frac{QK^T}{\sqrt{d_k}}$) V`	• Dot product of queries and keys divided by $\sqrt{d_k}$ to prevent gradient saturation • standard attention formula in transformers