Transformer Architecture Cheat Sheet

Updated 2026-03-17

Next Topic: Variational Autoencoders (VAEs) Cheat Sheet

Transformer architecture, introduced in 2017 by Vaswani et al., revolutionized deep learning by replacing recurrent and convolutional layers with an attention-based mechanism that processes sequences in parallel. Unlike RNNs and LSTMs, transformers eliminate sequential dependencies through self-attention, allowing every token to attend to every other token simultaneously — enabling massively parallel training that scales to billions of parameters. At its core, the transformer relies on three key concepts: multi-head attention for capturing diverse relationships, positional encodings for sequence order awareness, and residual connections with layer normalization for stable deep network training. Understanding transformers is essential not only because they power modern large language models like GPT, BERT, and T5, but also because their architectural innovations — from encoder-decoder structures to efficient attention variants — have influenced nearly every domain in AI, from vision (ViT) to multimodal systems.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 79 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Attention MechanismsTable 2: Query, Key, Value ComputationsTable 3: Positional Encoding TechniquesTable 4: LayerNormalization and Residual ConnectionsTable 5: Feed-Forward NetworkTable 6: Encoder and Decoder ArchitecturesTable 7: Transformer Variants and ModelsTable 8: Model HyperparametersTable 9: Training TechniquesTable 10: Optimization AlgorithmsTable 11: Attention Complexity and EfficiencyTable 12: Token Embeddings and VocabularyTable 13: Inference and Decoding StrategiesTable 14: Training Paradigms and ObjectivesTable 15: Advanced Architectural Concepts

Table 1: Core Attention Mechanisms

Mechanism	Example	Description
Self-Attention	`Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V`	• Allows each position in a sequence to attend to all positions in the same sequence • query, key, and value all derived from the same input • forms the foundation of transformer encoder and decoder blocks.
Scaled Dot-Product Attention	`score = (Q @ K.T) / sqrt(d_k)` `attn_weights = softmax(score)`	• Computes attention scores via dot product of queries and keys, scaled by $\frac{1}{\sqrt{d_k}}$ to prevent gradient saturation in softmax • scaling maintains stable variance as embedding dimensions grow.
Multi-Head Attention	8 heads with $d_{model}=512$ `head_dim = 512/8 = 64`	• Splits queries, keys, and values into h parallel attention heads, each operating on dimension $d_k = d_v = d_{model}/h$ • enables model to jointly attend to information from different representation subspaces • outputs are concatenated and linearly projected.
Cross-Attention	Decoder attends to encoder: `Q=decoder, K=encoder, V=encoder`	• Queries come from one sequence (decoder) while keys and values come from another sequence (encoder) • allows decoder to focus on relevant parts of encoder output • essential in encoder-decoder architectures for seq2seq tasks.

Table 1: Core Attention Mechanisms

Mechanism	Example	Description
Self-Attention	`Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V`	• Allows each position in a sequence to attend to all positions in the same sequence • query, key, and value all derived from the same input • forms the foundation of transformer encoder and decoder blocks.
Scaled Dot-Product Attention	`score = (Q @ K.T) / sqrt(d_k)` `attn_weights = softmax(score)`	• Computes attention scores via dot product of queries and keys, scaled by $\frac{1}{\sqrt{d_k}}$ to prevent gradient saturation in softmax • scaling maintains stable variance as embedding dimensions grow.
Multi-Head Attention	8 heads with $d_{model}=512$ `head_dim = 512/8 = 64`	• Splits queries, keys, and values into h parallel attention heads, each operating on dimension $d_k = d_v = d_{model}/h$ • enables model to jointly attend to information from different representation subspaces • outputs are concatenated and linearly projected.
Cross-Attention	Decoder attends to encoder: `Q=decoder, K=encoder, V=encoder`	• Queries come from one sequence (decoder) while keys and values come from another sequence (encoder) • allows decoder to focus on relevant parts of encoder output • essential in encoder-decoder architectures for seq2seq tasks.