Transformer architecture, introduced in 2017 by Vaswani et al., revolutionized deep learning by replacing recurrent and convolutional layers with an attention-based mechanism that processes sequences in parallel. Unlike RNNs and LSTMs, transformers eliminate sequential dependencies through self-attention, allowing every token to attend to every other token simultaneously — enabling massively parallel training that scales to hundreds of billions of parameters. At its core, the transformer relies on three key concepts: multi-head attention for capturing diverse relationships, positional encodings for sequence order awareness, and residual connections with layer normalization for stable deep network training. Understanding transformers is essential not only because they power modern large language models like GPT, BERT, LLaMA, and DeepSeek, but also because their architectural innovations — from encoder-decoder structures to efficient attention variants and hybrid architectures — have influenced every domain in AI, from vision (ViT) to multimodal and agentic systems.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 115 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Attention Mechanisms
Self-attention is the engine of every transformer block: it lets each token build a context-aware representation by attending to all other tokens simultaneously. The many variants below trade off memory, speed, and quality differently — choosing the right one is critical for modern LLM design.
| Mechanism | Example | Description |
|---|---|---|
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V | • Allows each position to attend to all positions in the same sequence • query, key, and value all derived from the same input • forms the foundation of every transformer encoder and decoder block | |
score = (Q @ K.T) / sqrt(d_k)attn_weights = softmax(score) | • Computes attention scores via dot product of queries and keys, scaled by \frac{1}{\sqrt{d_k}} to prevent gradient saturation in softmax• scaling maintains stable variance as embedding dimensions grow | |
8 heads with d_{model}=512head_dim = 512/8 = 64 | • Splits Q, K, V into h parallel attention heads each of dimension d_k = d_{model}/h• enables the model to jointly attend to information from different representation subspaces • outputs concatenated and linearly projected | |
Decoder attends to encoder: Q=decoder, K=encoder, V=encoder | • Queries come from one sequence (decoder) while keys and values come from another (encoder) • allows decoder to focus on relevant encoder positions • essential in encoder-decoder architectures for seq2seq tasks | |
Upper-triangular mask: [[0,-inf,-inf],[0,0,-inf],[0,0,0]] | • Prevents positions from attending to subsequent positions by applying a causal mask before softmax • ensures token i only sees tokens \leq i• used in decoder-only models (GPT) and decoder blocks |