Transformer architecture, introduced in 2017 by Vaswani et al., revolutionized deep learning by replacing recurrent and convolutional layers with an attention-based mechanism that processes sequences in parallel. Unlike RNNs and LSTMs, transformers eliminate sequential dependencies through self-attention, allowing every token to attend to every other token simultaneously — enabling massively parallel training that scales to billions of parameters. At its core, the transformer relies on three key concepts: multi-head attention for capturing diverse relationships, positional encodings for sequence order awareness, and residual connections with layer normalization for stable deep network training. Understanding transformers is essential not only because they power modern large language models like GPT, BERT, and T5, but also because their architectural innovations — from encoder-decoder structures to efficient attention variants — have influenced nearly every domain in AI, from vision (ViT) to multimodal systems.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 79 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Attention Mechanisms
| Mechanism | Example | Description |
|---|---|---|
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V | • Allows each position in a sequence to attend to all positions in the same sequence • query, key, and value all derived from the same input • forms the foundation of transformer encoder and decoder blocks. | |
score = (Q @ K.T) / sqrt(d_k)attn_weights = softmax(score) | • Computes attention scores via dot product of queries and keys, scaled by \frac{1}{\sqrt{d_k}} to prevent gradient saturation in softmax• scaling maintains stable variance as embedding dimensions grow. | |
8 heads with d_{model}=512head_dim = 512/8 = 64 | • Splits queries, keys, and values into h parallel attention heads, each operating on dimension d_k = d_v = d_{model}/h• enables model to jointly attend to information from different representation subspaces • outputs are concatenated and linearly projected. | |
Decoder attends to encoder: Q=decoder, K=encoder, V=encoder | • Queries come from one sequence (decoder) while keys and values come from another sequence (encoder) • allows decoder to focus on relevant parts of encoder output • essential in encoder-decoder architectures for seq2seq tasks. |