Large Language Models are transformer-based neural networks trained on massive text datasets to generate, understand, and manipulate human language at scale. At their core, LLMs use self-attention mechanisms to capture contextual relationships between tokens, enabling them to perform tasks ranging from translation and summarization to code generation and complex reasoning. The field has evolved rapidly—from foundational pre-training on trillions of tokens, through specialized fine-tuning and alignment techniques, to sophisticated reasoning models trained with reinforcement learning from verifiable rewards. A key insight: LLMs don't simply memorize text—their emergent abilities to reason in-context and solve unseen problems arise from the interplay of architecture, scale, and post-training dynamics, with 2025–2026 marking a shift toward agentic, multimodal, and long-context systems.
22 tables, 147 concepts. Select a concept node to jump to its table row.
Table 1: Core Transformer Architecture
Every modern LLM is built from this handful of components stacked dozens of times over. Self-attention is the heart of it — letting each token weigh every other token — and the rest exist to make that work at depth and scale: feed-forward layers add non-linear capacity, residual connections and normalization keep gradients flowing through hundreds of layers, and positional encoding restores the word-order information that attention alone throws away. Master these and you understand the skeleton shared by GPT, LLaMA, and Qwen alike.
| Component | Example | Description |
|---|---|---|
scores = Q @ K.T / sqrt(d_k) | • Computes query-key-value relationships where each token attends to all positions • forms the foundation of transformer parallelization by replacing recurrence. | |
heads = 8Q, K, V = Linear(x, d_k) | • Splits attention into multiple parallel heads learning different representation subspaces • each head computes scaled dot-product attention independently, then concatenates results. | |
mask = torch.triu(ones) * -inf | • Prevents tokens from attending to future positions via upper-triangular mask • critical for autoregressive generation in decoder-only models like GPT. | |
encoder_output → decoder | • Used in encoder-decoder models where decoder queries attend to encoder keys/values • enables translation and seq2seq by bridging input-output representations. | |
FFN(x) = ReLU(xW1 + b1)W2 | • Two-layer MLP applied position-wise after attention • typically expands dimension 4× then projects back, providing non-linearity and feature transformation. | |
output = x + SubLayer(x) | • Adds input directly to sublayer output enabling gradient flow through deep networks • prevents vanishing gradients and allows training of 100+ layer transformers. | |
LN(x) = γ(x - μ) / σ + β | • Normalizes activations across feature dimension for each token • stabilizes training and enables higher learning rates. | |
PE = sin(pos/10000^(2i/d)) | • Injects order information into token embeddings • the transformer is permutation-invariant without explicit position signals. | |
q = RMSNorm(q); k = RMSNorm(k) | • Applies RMS normalization to query and key vectors before dot-product attention • stabilizes training of very large models and is used in Qwen3, Trinity Large. |
Table 2: Positional Encoding Variants
Since attention treats a sequence as an unordered bag of tokens, a model needs some way to know which word came first — and the choice of how to inject that order has outsized consequences for how far a model can extrapolate beyond its training length. These variants trace the field's evolution from the original fixed sinusoids and learned absolute embeddings toward relative schemes like RoPE and ALiBi that generalize to longer contexts, and even NoPE, which drops explicit position signals entirely.
| Method | Example | Description |
|---|---|---|
rotate(q) @ rotate(k).T | • Applies rotation matrices to query-key pairs encoding relative position via complex-plane rotation • used in LLaMA, GPT-NeoX—enables good length extrapolation. | |
pos_emb = Embedding(max_len, d_model) | • Trainable position embeddings learned during training • used in BERT and GPT-2 • limited to max_len seen during training without extrapolation. | |
PE(pos, 2i) = sin(pos/10000^(2i/d))PE(pos, 2i+1) = cos(pos/10000^(2i/d)) | • Original Transformer encoding using fixed sine/cosine functions at different frequencies • provides unique position signals but doesn't explicitly encode relative distances. | |
attn + bias * (-1, -2, -3, ...) | • Adds linear penalty to attention scores based on distance • no position embeddings needed • excellent extrapolation to longer sequences than training context. | |
bias = learned_bias[k - q] | • Encodes distance between positions rather than absolute indices • better length generalization but computationally expensive for long sequences. | |
no pos embeddings in global layers | • Eliminates positional embeddings in global attention layers entirely • relies on architectural bias and training for order • used in SmolLM3 global attention layers. |
Table 3: Model Architecture Variants
The same transformer building blocks can be wired in fundamentally different ways, and that wiring decides what a model is good at. Decoder-only models dominate today because their autoregressive design scales and generates so well, but encoder-only models still win on pure understanding tasks, and encoder-decoder models suit translation. The later rows cover the structural innovations reshaping frontier models — Mixture of Experts for scaling parameters cheaply, reasoning models that think before answering, and vision-language models that fold images into the same context.
| Type | Example | Description |
|---|---|---|
GPT-2, GPT-3, LLaMA, Qwen3 | • Causal masked attention for autoregressive generation • trained to predict next token • dominates modern LLMs due to scalability and generation quality. | |
BERT, RoBERTa | • Bidirectional attention processes entire sequence • excels at understanding tasks like classification, NER, question answering • cannot generate text autoregressively. | |
T5, BART | • Separate encoder (bidirectional) and decoder (causal) with cross-attention • optimal for seq2seq tasks like translation and summarization. | |
router(x) → top-k experts | • Sparsely activated architecture where gating network routes tokens to a subset of expert FFNs • scales parameters without proportional compute increase (Mixtral, DeepSeek-V3). | |
<think>...</think>\nAnswer | • Decoder-only model trained with RL to produce extended chain-of-thought in a scratchpad before the final answer • emergent self-reflection and verification (o1, DeepSeek-R1). | |
CLIP, LLaVA, Gemini | • Integrates vision encoder (ViT) with language model via projection layer or cross-attention • enables multimodal understanding from images and text jointly. |
Table 4: Tokenization Algorithms
Before a model ever sees text it has to be chopped into tokens, and the algorithm that does the chopping shapes vocabulary size, how gracefully the model handles rare or unseen words, and even how well it works across languages. These methods mostly differ in how they decide which character or subword pieces to merge — by frequency in BPE, by likelihood in WordPiece, by probability in Unigram — with byte-level variants guaranteeing that any string at all can be encoded without unknown tokens.
| Algorithm | Example | Description |
|---|---|---|
"playing" → ["play", "ing"] | • Iteratively merges most frequent character pairs in training corpus • balances vocabulary size with coverage • used in GPT-2, GPT-3—stores merge rules. | |
treats spaces as token "_" | • Language-agnostic tokenizer operating on raw text without pre-tokenization • encodes whitespace as special character • supports BPE and unigram models • used in T5, LLaMA. | |
"unaffable" → ["un", "##aff", "##able"] | • Similar to BPE but selects merges based on likelihood maximization rather than frequency • used in BERT • saves final vocabulary only, not merge operations. | |
probabilistic token selection | • Maintains vocabulary with token probabilities • removes tokens iteratively to minimize loss • allows multiple segmentations unlike greedy BPE/WordPiece. | |
any byte sequence tokenizable | • Operates on raw UTF-8 bytes so any string is tokenizable without unknown tokens • used in GPT-2 and GPT-4 (tiktoken) • fully language-agnostic. |
Table 5: Pre-training Objectives
The pre-training objective is the self-supervised game a model plays over trillions of tokens, and it determines what kind of model you end up with. Predicting the next token (CLM) gives you a generator like GPT; masking and predicting tokens from both sides (MLM) gives you a bidirectional understander like BERT; corruption-and-reconstruction objectives power text-to-text models like T5. The remaining rows cover newer twists like multi-token prediction for speed and contrastive learning for aligning text with images.
| Objective | Example | Description |
|---|---|---|
P(token_i | token_<i) | • Predict next token given left context only • standard decoder-only objective maximizing likelihood of training sequences • used in GPT family. | |
P([MASK] | context) | • Randomly masks ~15% of tokens and predicts them from bidirectional context • BERT's core pre-training objective enabling deep bidirectional representations. | |
predict tokens t+1, t+2, t+3 | • Trains multiple prediction heads to predict several future tokens simultaneously • improves sample efficiency and enables speculative decoding at inference (DeepSeek-V3: 1.8× speedup). | |
corrupt → reconstruct | • Masks or corrupts spans of text then trains model to reconstruct original • T5 frames all tasks as text-to-text generation with varying corruption strategies. | |
bidirectional prefix → causal | • Applies bidirectional attention to prefix then causal attention for continuation • bridges encoder and decoder benefits (UniLM, GLM). | |
align(text, image) | • Trains model to match positive pairs (text-image) while separating negative pairs • CLIP uses dual encoders with contrastive loss for vision-language alignment. |
Table 6: Fine-Tuning Techniques
Once a model is pre-trained, fine-tuning adapts it to a specific task or behavior — and the central tension here is cost versus completeness. Full supervised fine-tuning updates every parameter, but the parameter-efficient methods that follow (LoRA, QLoRA, DoRA, adapters, prefix and prompt tuning) freeze the base model and train a tiny fraction of new weights instead, slashing memory enough to fine-tune huge models on a single GPU while keeping quality close. Instruction tuning sits apart as the step that teaches a model to follow commands rather than just continue text.
| Technique | Example | Description |
|---|---|---|
train on (input, output) pairs | • Standard gradient-based training on task-specific labeled data • updates all parameters to adapt pre-trained model to downstream task. | |
"Translate: [text]" → output | • Fine-tunes on diverse instruction-following datasets formatted as explicit commands • dramatically improves zero-shot task generalization (FLAN, InstructGPT). | |
ΔW = BA (rank r << d) | • Freezes base model and trains low-rank decomposition matrices injected into attention layers • reduces trainable parameters by 10,000× while preserving quality. | |
4-bit base + LoRA adapters | • Quantizes base model to 4-bit precision and applies LoRA • enables fine-tuning 65B models on a single 48GB GPU with minimal degradation. | |
W = m · (W₀ + BA) / ‖W₀+BA‖ | • Decomposes pre-trained weights into magnitude and direction components • trains direction via LoRA and magnitude separately • consistently outperforms LoRA with no extra inference cost. | |
insert bottleneck layers | • Adds small trainable feed-forward modules between frozen transformer layers • modular approach allowing task-specific adapters without full model copies. | |
prepend trainable vectors | • Optimizes continuous task-specific vectors prepended to each layer • keeps model frozen • competitive with full fine-tuning on some tasks with 0.1% parameters. | |
optimize soft prompts | • Learns continuous prompt embeddings rather than discrete text • more parameter-efficient than prefix tuning • effectiveness increases with model scale. |
Table 7: Alignment and RLHF
Alignment is what turns a raw next-token predictor into a model that's helpful, harmless, and actually does what you ask, and this is one of the fastest-moving areas in the field. The lineage runs from classic three-stage RLHF (reward model plus PPO) toward simpler, cheaper successors — DPO and its variants skip the separate reward model, GRPO drops the critic, and RLVR replaces human labels with programmatic verifiers, which is precisely what unlocked the self-reflecting reasoning models like DeepSeek-R1.
| Method | Example | Description |
|---|---|---|
reward model → PPO training | • Three-stage process: SFT, train reward model on human preferences, optimize policy via PPO • used in ChatGPT and Claude—computationally expensive. | |
sample G outputs → normalize rewards | • Samples a group of responses per prompt and estimates advantages by normalizing rewards within the group • eliminates the critic model, halving memory vs. PPO • used in DeepSeek-R1. | |
directly optimize preferences | • Bypasses reward model by directly optimizing policy on preference pairs using Bradley-Terry model • simpler and more stable than RLHF with comparable results. | |
avg log-prob as implicit reward | • Removes the reference model by using average log-probability of a response as implicit reward • outperforms DPO by 6+ points on AlpacaEval 2 with no extra memory cost. | |
combine SFT + preference in one loss | • Merges SFT and preference alignment into a single training objective using odds ratios • eliminates reference model and separate SFT stage—one pass instead of two. | |
thumbs-up / thumbs-down labels | • Works with binary feedback instead of pairwise preferences • cheaper data collection for production systems with like/dislike signals • derived from prospect theory. | |
unit test pass / math checker → reward | • Uses programmatic verifiers (unit tests, math checkers) as reward signal instead of human labels • enables emergent self-reflection and verification for math and code tasks. | |
clip-higher + token-level loss | • Extends GRPO for long chain-of-thought reasoning with entropy collapse prevention, dynamic sampling, and token-level policy gradient • 50% fewer steps than DeepSeek-R1-Zero. | |
self-critique + revision | • Model critiques its own outputs against constitutional principles then revises • reduces reliance on human feedback for harmlessness alignment. | |
AI-generated preferences | • Replaces human labelers with AI-generated feedback for reward model training • scales preference data collection • effective for capability-focused alignment. | |
score(response) from pairs | • Trains classifier to predict human preference between response pairs • converts subjective preferences into scalar reward signal for RL optimization. |
Table 8: Inference Optimization Techniques
Serving an LLM cheaply and fast is its own engineering discipline, and these techniques attack the bottlenecks that make autoregressive generation slow and memory-hungry. KV and prefix caching avoid recomputing work across tokens and requests, Flash Attention and PagedAttention squeeze far more out of GPU memory, and speculative methods like draft models and Medusa generate several tokens at once to break the one-token-at-a-time barrier. Together they're what make production inference economically viable.
| Technique | Example | Description |
|---|---|---|
cache computed K, V | • Stores previously computed key-value pairs during autoregressive generation • avoids redundant computation—essential for production inference efficiency. | |
cache system prompt KV once | • Reuses KV cache for identical prompt prefixes across requests • skips re-computing static content like system prompts • up to 90% cost and 85% latency reduction for long prompts. | |
tiling + recomputation | • IO-aware attention algorithm using block-wise computation and kernel fusion • reduces memory bandwidth by 5–20× enabling 2–4× speedup on long sequences. | |
block-level memory management | • Manages KV cache in non-contiguous blocks like OS paging • reduces memory fragmentation and increases batch size in vLLM by 2–24×. | |
draft model → verify in parallel | • Small draft model generates candidate tokens that large model verifies in parallel • 2–3× speedup without changing output distribution. | |
extra decoding heads predict t+1, t+2 | • Adds multiple prediction heads to an LLM and verifies candidates via tree-based attention in parallel • 2.2–3.6× speedup without a separate draft model. | |
dynamic request batching | Evicts finished sequences and immediately adds new ones—in-flight batching—maximizes GPU utilization vs. static batching. | |
convert FP16 → INT8 | • Reduces precision of weights/activations to 8-bit or lower • 2–4× memory reduction and speedup with <1% accuracy loss using calibration. | |
improved kernel fusion | • Enhanced IO-aware kernels with lower SRAM usage and better parallelization • Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations. |
Table 9: Sampling and Decoding Strategies
A model outputs a probability distribution over the next token; the decoding strategy decides which token actually gets picked, and it's the single biggest lever on whether output feels robotic or creative. Greedy decoding always grabs the top choice and tends toward repetition, while temperature, top-p, top-k, and min-p sampling inject controlled randomness to keep generation lively without going off the rails. Beam and contrastive search take different routes for tasks like translation that reward coherence over surprise.
| Strategy | Example | Description |
|---|---|---|
argmax(logits) | • Selects highest probability token at each step • deterministic and fast but prone to repetitive, low-quality outputs for creative tasks. | |
logits / T before softmax | • Scales logits by T—lower T (0.1–0.7) sharpens distribution • higher T (1.0–2.0) increases randomness/creativity. | |
cumulative prob >= p=0.9 | • Dynamically selects smallest set of tokens whose cumulative probability exceeds threshold p • adapts to distribution shape—preferred for open-ended generation. | |
sample from top k=40 tokens | • Restricts sampling to k highest-probability tokens • prevents sampling rare tokens but fixed k can be too restrictive or permissive depending on distribution. | |
filter tokens < min_p * max_prob | • Removes tokens with probability below min_p fraction of the maximum token probability • better than top-p for maintaining quality while allowing creativity. | |
keep top-k sequences | • Maintains k parallel hypotheses and expands most probable • balances quality and diversity but computationally expensive • common for translation. | |
argmax(model_score - α·cos_sim) | • Selects tokens that are probable but distinct from the previous context via cosine similarity penalty • reduces repetition while maintaining coherence. |
Table 10: Training Optimizations
Training a large model bumps into the hard limits of GPU memory and time, and these techniques are the everyday tricks for staying within them. Mixed precision and gradient checkpointing trade bits and recomputation for room to fit bigger models, gradient accumulation fakes a larger batch size, and the optimizer-and-schedule choices — AdamW with warmup and cosine decay — are the near-universal recipe that keeps large-model training from diverging.
| Technique | Example | Description |
|---|---|---|
store in FP16, compute in FP32 | • Uses 16-bit floats for storage/computation with FP32 master weights • reduces memory by ~2× and accelerates training with Tensor Cores—BF16 preferred for stability. | |
recompute activations | • Trades compute for memory by recomputing forward activations during backward pass instead of storing • enables 2–10× larger models at 20–30% slowdown. | |
accumulate over N steps | • Simulates larger effective batch size by accumulating gradients across microbatches before update • crucial for training large models on limited memory. | |
decouple weight decay | • Fixes weight decay in Adam by applying it directly to weights, not gradient • improves generalization • standard optimizer for transformer training. | |
linear increase 0 → max_lr | • Gradually increases learning rate from zero over initial steps (typically 2–10% of training) • stabilizes training of large models and prevents divergence. | |
lr = min + 0.5(max-min)(1+cos) | • Decreases learning rate following cosine curve • smooth decay helps model converge to flatter minima • often combined with warmup. | |
project grad to low-rank subspace | • Projects gradients into a low-rank subspace via periodic SVD • enables full-rank training with optimizer memory footprint comparable to LoRA • no inference overhead. |
Table 11: Distributed Training Strategies
No single GPU can hold a frontier model, so training is spread across many — and these strategies differ in what they split. Data parallelism replicates the model and splits the batch; tensor and pipeline parallelism slice the model itself, across layers or within them; and ZeRO/FSDP shard the optimizer states and parameters to wring out memory. Real large-scale runs combine several of these at once, so understanding what each one partitions is the key to reading any training setup.
| Strategy | Example | Description |
|---|---|---|
replicate model across GPUs | • Each GPU holds full model copy processing different data batches • gradients averaged across devices • simplest parallelism but memory-limited by single GPU. | |
split layers across GPUs | • Partitions individual layers (e.g., attention heads) within model across devices • requires frequent all-reduce communication • used in Megatron-LM for very large models. | |
shard optimizer states | • Partitions optimizer states, gradients, parameters across devices with on-demand gathering • ZeRO-3 achieves model parallelism memory efficiency with data parallelism simplicity. | |
PyTorch native ZeRO-3 | • PyTorch implementation of ZeRO-3 sharding • automatically manages parameter gathering/scattering • simpler API than DeepSpeed for distributed training. | |
layer stages on different GPUs | • Splits model vertically by layers creating pipeline stages • microbatching reduces bubble overhead • GPipe, PipeDream frameworks. | |
partition along sequence dim | • Splits sequence length across GPUs for memory-constrained layers like LayerNorm • extends tensor parallelism reducing activation memory in long sequences. |
Table 12: Model Compression Techniques
Compression shrinks a trained model so it runs faster and cheaper, ideally without losing much accuracy. Quantization drops weights to lower precision, distillation trains a small student to imitate a large teacher, and pruning cuts out weights or whole components that contribute little — each making a different trade between how much you save and how much retraining or calibration it costs.
| Technique | Example | Description |
|---|---|---|
GPTQ, AWQ | • Converts trained model weights to lower precision (INT4/8) without retraining • calibration on small dataset • 3–4× compression with minimal accuracy loss. | |
student learns from teacher | • Trains small student model to match outputs or intermediate representations of large teacher • compresses model while retaining capabilities—DistilBERT example. | |
remove low-magnitude weights | • Removes unimportant weights or entire structured components (heads, layers) • can reduce parameters 40–60% but requires careful calibration or retraining. | |
simulate quantization during training | • Inserts fake quantization operations in forward pass to model precision effects • typically achieves better accuracy than post-training methods. | |
W ≈ UV (rank r) | • Approximates weight matrices as product of low-rank matrices • reduces parameters in linear layers • basis of LoRA and similar PEFT methods. |
Table 13: Context Window and Long-Context Techniques
A model can only attend over so many tokens at once, and attention's quadratic cost makes simply growing that window expensive — so these techniques extend a model's effective reach in two complementary ways. RAG sidesteps the limit by retrieving relevant text on demand, while RoPE scaling, position interpolation, and YaRN stretch the position encoding to handle far longer sequences than training; sliding-window and sparse attention cut the quadratic cost so the longer context is actually affordable to run.
| Technique | Example | Description |
|---|---|---|
retrieve docs → augment prompt | • Retrieves relevant documents from external knowledge base and injects into context • extends effective knowledge beyond model limits • requires good retrieval system. | |
adjust rotation frequencies | • Modifies RoPE base frequency to extrapolate to longer sequences • simple method enabling 2–4× context extension with minimal fine-tuning. | |
compress position indices | • Interpolates positions within training range rather than extrapolating • better stability than direct extrapolation for extended context. | |
scale + adjust NTK base | • Combines NTK-aware scaling with attention temperature adjustment per head • efficient context extension from 4K to 128K+ tokens. | |
attend to local window | • Each token attends only to fixed-size window around its position • linear memory but limited long-range modeling • used in Longformer. | |
attend to subset of positions | • Computes attention only for selected position pairs using patterns (local, strided, global) • reduces O(n²) complexity enabling 10×+ longer sequences. | |
compress past into memory | • Summarizes earlier context into compressed memory state • enables unbounded context in theory but loses fine-grained information from distant past. |
Table 14: Prompt Engineering Techniques
Prompting is how you steer a frozen model at inference time, no training required — and a surprising amount of capability is unlocked just by how you ask. Few-shot and zero-shot prompting set the baseline, while the reasoning-oriented techniques are the real workhorses: chain-of-thought coaxes the model to show its steps, self-consistency votes across multiple attempts, and ReAct and tree-of-thoughts add tool use and search for genuinely hard problems.
| Technique | Example | Description |
|---|---|---|
Example1, Example2, ... Query | • Demonstrates task through 2–10 input-output examples in prompt • exploits in-context learning • effectiveness grows with model scale • examples should be diverse. | |
"Translate to French: [text]" | • Provides task instruction only without examples • relies on pre-training and instruction tuning • quality highly dependent on model capabilities and prompt clarity. | |
"Let's think step by step" | • Prompts model to generate intermediate reasoning steps before final answer • dramatically improves performance on math, logic, commonsense reasoning. | |
sample multiple paths → vote | • Generates multiple reasoning paths then selects most consistent answer via majority voting • improves reliability over single-path CoT. | |
Thought → Action → Observation | • Interleaves reasoning and tool use • model generates thoughts, selects actions (API calls, searches), observes results iteratively until solution found. | |
explore reasoning tree | • Explores multiple reasoning branches with backtracking and evaluation • enables deliberate problem-solving for complex tasks requiring search. | |
outline → parallel expand | • Generates a skeleton outline first then expands each point in parallel • reduces end-to-end latency by up to 2× on modern hardware. |
Table 15: Emergent Capabilities and Scaling
This is the theory behind why making models bigger keeps working. Scaling laws predict how loss falls as a power law with compute, data, and size, and the Chinchilla result refined that into how to balance the budget — revealing that many models were badly undertrained. The more surprising entries are in-context learning and emergent abilities: capabilities like few-shot learning and reasoning that appear seemingly out of nowhere once a model crosses a certain scale.
| Concept | Example | Description |
|---|---|---|
L(N) ∝ N^(-α) | • Loss scales as power law with compute, model size, dataset size • predicts training compute allocation • vocabulary size also affects optimal scaling. | |
Chinchilla scaling | • For fixed compute budget, balanced scaling of model size and training tokens is optimal • suggests many models are undertrained relative to size. | |
few-shot without gradients | • Ability to learn new tasks from examples in prompt without parameter updates • improves with scale • mechanism may involve induction heads in attention layers. | |
reasoning, arithmetic | • Capabilities that appear suddenly at scale not present in smaller models • includes in-context learning and chain-of-thought reasoning • debated whether truly emergent or metric artifacts. | |
pre-train → fine-tune | • Pre-trained models encode general language understanding transferable to downstream tasks • foundation of modern NLP—larger models transfer better. |
Table 16: Activation Functions in Transformers
The activation function inside each feed-forward layer is a small choice with measurable effect on quality. The field has drifted from the simple ReLU of the original Transformer toward smoother and gated variants — GELU in BERT and GPT-2, and the gated SwiGLU and GeGLU that power LLaMA and PaLM because they consistently squeeze out a bit more performance, at the cost of slightly larger layers.
| Function | Example | Description |
|---|---|---|
SwiGLU(x) = Swish(xW) ⊙ (xV) | • Gated variant using Swish activation (x·sigmoid(x)) with element-wise gating • used in LLaMA, PaLM—empirically outperforms GELU • requires ~50% more FFN parameters for same hidden size. | |
GELU(x) = x·Φ(x) | • Smooth approximation applying Gaussian CDF • used in BERT, GPT-2 • better gradient properties than ReLU • probabilistic interpretation as neuron dropout. | |
GeGLU(x) = GELU(xW) ⊙ (xV) | • Similar to SwiGLU but uses GELU for gating • strong performance on language tasks • used in T5 variants. | |
ReLU(x) = max(0, x) | • Simple piecewise linear function • original Transformer used ReLU • computationally efficient but can suffer from dead neurons • largely replaced in modern LLMs. |
Table 17: Normalization Techniques
Normalization keeps activations in a stable range so deep transformers can train without blowing up, and two questions dominate the design: which normalizer, and where to put it. RMSNorm has largely displaced classic LayerNorm in modern LLMs for being faster at comparable quality, while the Pre-LN versus Post-LN placement decides how stable training is and whether you need a warmup. Dropout, once standard, fades in the largest models that are data-starved rather than overfit.
| Method | Example | Description |
|---|---|---|
RMSNorm(x) = x / RMS(x) · γ | • Simplified LN removing mean centering—only normalizes by RMS • 10–20% faster than LN with comparable performance • used in LLaMA, Grok, Qwen3. | |
LN(x) = (x - μ) / σ · γ + β | • Normalizes across feature dimension for each token independently • standard in transformers • mean/variance computed per-sample allowing any batch size. | |
Pre: LN(x) → Sublayer | • Pre-LN applies normalization before sublayer (modern default—more stable) • Post-LN applies after (original Transformer—requires warmup) • Pre-LN enables easier convergence. | |
randomly zero with p=0.1 | • Randomly drops activations during training as regularization • less common in very large LLMs which are underparameterized relative to data • typical rates 0.1–0.2. |
Table 18: Evaluation Metrics and Benchmarks
Knowing how good a model actually is means picking the right yardstick, and these fall into two camps. Benchmarks like MMLU, GPQA, HumanEval, and SWE-bench probe knowledge, science, and coding ability — though many are nearing saturation or risk contamination, which is why contamination-resistant and human-preference evaluations (LiveBench, Chatbot Arena) have gained ground. The classic automatic metrics — perplexity, BLEU, ROUGE, BERTScore — measure narrower properties like fluency, translation overlap, and summarization quality.
| Metric | Example | Description |
|---|---|---|
57 subjects, 4-way multiple choice | • Tests knowledge and reasoning across STEM, humanities, social sciences • standard benchmark for general capabilities • 0–100% accuracy. | |
448 PhD-level science questions | • Tests doctoral-level knowledge in biology, physics, chemistry • designed to be hard even with internet access • nearing saturation for frontier models (~94% as of 2026). | |
code synthesis benchmark | • Evaluates code generation with 164 hand-written programming problems • pass@k metric measures functional correctness • standard for coding models. | |
GitHub issue → code fix | • Tests real-world software engineering—resolving GitHub issues in Python repos • measures fraction of issues resolved • key agentic coding benchmark. | |
monthly refreshed questions | • Contamination-resistant benchmark with questions refreshed monthly using recent data sources • covers math, coding, reasoning, language, data analysis. | |
human preference pairwise ranking | • Crowdsourced pairwise preference evaluation • ELO-rated from millions of blind votes • strong signal for real-world conversational quality. | |
PPL = exp(avg_loss) | • Measures how surprised model is by test data • lower is better • exponential of average cross-entropy loss • standard language modeling metric. | |
n-gram precision with brevity | • Compares n-gram overlap between generated and reference translations • 0–100 scale • standard for machine translation evaluation. | |
ROUGE-L, ROUGE-N | • Measures recall-oriented n-gram overlap • primarily for summarization • ROUGE-L uses longest common subsequence. | |
contextual embedding similarity | • Computes token similarity using BERT embeddings rather than exact matches • captures semantic similarity better than n-gram metrics. |
Table 19: Attention Mechanism Optimizations
Standard multi-head attention is expensive at inference, mostly because of the memory the KV cache consumes, so these optimizations rework attention to be cheaper. GQA and MQA shrink the cache by sharing keys and values across heads, MLA compresses them into a low-rank latent (the trick behind DeepSeek-V3), and sliding-window, linear, and Flash Attention attack the quadratic cost from different angles — most of them trading a sliver of quality for major savings in speed and memory.
| Optimization | Example | Description |
|---|---|---|
heads share K, V in groups | • Compromise between MQA and MHA: groups of heads share K, V • balances quality-efficiency tradeoff • used in LLaMA-2/3, Mistral, Qwen3. | |
compress KV to latent → cache | • Compresses keys and values into a low-rank latent vector (joint KV compression) before caching • reduces KV cache by up to 93% vs. MHA with better modeling quality • used in DeepSeek-V3, Kimi K2. | |
single K, V across heads | • Shares same key-value projections across all heads using only multiple queries • reduces KV cache memory and speeds inference but slightly lower quality than GQA. | |
improved kernel fusion | • Enhanced IO-aware kernels with lower SRAM usage and better parallelization • Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations. | |
attend to k nearest tokens | • Restricts attention to fixed-size local window • reduces complexity to O(n·k) from O(n²) • enables longer sequences but limited global context. | |
kernel trick approximation | • Approximates softmax attention using kernel methods reducing complexity to O(n) • enables efficient very long sequences but quality gaps remain vs. full attention. |
Table 20: Advanced Training Techniques
Beyond the standard recipe, these techniques shape what and in what order a model learns. Multi-task and curriculum learning structure the training signal — sharing parameters across tasks, or sequencing easy examples before hard ones — while continual learning tackles the problem of updating a model on new data without erasing what it already knew. Contrastive learning and data augmentation round out the toolkit for representation quality and squeezing more from limited data.
| Technique | Example | Description |
|---|---|---|
train on multiple tasks jointly | • Shares parameters across tasks expecting positive transfer • T5 frames everything as text-to-text • requires balanced sampling and task weighting. | |
easy → hard examples | • Orders training data from simple to complex • can improve convergence and final performance • domain-specific curriculum design needed. | |
incremental data updates | • Updates model on new data without forgetting previous knowledge • addresses catastrophic forgetting through rehearsal, regularization, or architectural solutions. | |
SimCLR, CLIP | • Learns representations by contrasting positive pairs against negatives • CLIP aligns text-image pairs • effective for self-supervised and multimodal learning. | |
backtranslation, paraphrasing | • Generates synthetic training variations from existing data • back-translation, EDA, GPT-generated examples • particularly useful for low-resource tasks. |
Table 21: Model Merging Techniques
Merging combines two or more fine-tuned models into one without any retraining, blending their abilities just by manipulating weights — a cheap way to fuse, say, a coding model and a chat model. The methods differ in how they reconcile conflicting weights: SLERP interpolates two models geometrically, task arithmetic adds and subtracts capability "vectors," and TIES and DARE resolve interference when merging many models at once, with evolutionary search automating the recipe hunt.
| Technique | Example | Description |
|---|---|---|
t=0.5 between model A and B | • Smoothly interpolates between two models' weights in spherical space preserving geometric properties • best for high-quality pairwise merges • limited to two models at a time. | |
trim → elect sign → disjoint merge | • Three-step process: trim redundant parameters, elect dominant sign direction, merge aligned parameters • handles multi-model merging by resolving parameter conflicts. | |
drop delta weights p=0.9, rescale | • Randomly drops task-vector delta weights then rescales remaining by 1/(1−p) • effective even dropping 90–99% of deltas • used as augment for TIES or Task Arithmetic. | |
task_vector = fine_tuned - pretrained | • Computes task vectors (delta weights) and combines via arithmetic • add vectors to merge capabilities, negate to remove behaviors • simple and composable. | |
layers 0-32 of A + 24-32 of B | • Concatenates layers from different models to create frankenmerge with exotic parameter counts (e.g., 9B from two 7B models) • experimental but can produce capable models. | |
evolutionary search over merge configs | • Uses evolutionary algorithms to automatically discover optimal merging recipes and hyperparameters • 50× cost reduction via MERGE³ on single GPU • produces SOTA merged models. | |
average weights of N fine-tuned models | • Averages weights of multiple fine-tuned versions of same base model • improves accuracy without increasing inference cost • greedy variant evaluates each addition. |
Table 22: LLM Agent Concepts
An agent is what you get when an LLM can act in the world rather than just talk — and tool use, the ability to call external functions through structured JSON, is the capability that crosses that line. The rest describe how agents string those actions together: the ReAct loop interleaves reasoning with acting, memory spans a single session or persists across many, planning decomposes big tasks, and multi-agent setups and RLVR-trained agents push toward systems that solve genuinely multi-step problems on their own.
| Concept | Example | Description |
|---|---|---|
{"function": "search", "args": {...}} | • LLM selects and invokes external functions (APIs, search, code execution) via structured JSON • the defining capability separating conversational models from agents. | |
Thought → Action → Observation → … | • Agent iterates Thought → Action → Observation cycles until task is complete • interleaves reasoning and acting for grounded multi-step problem solving. | |
context window + vector store | • Short-term: in-context window, cleared after session • Long-term: external vector store or database enabling retrieval across sessions. | |
task → subtask1, subtask2, ... | • Agent breaks large tasks into manageable subtasks via chain-of-thought or tree-of-thought • can reflect and revise plans based on intermediate results. | |
orchestrator → specialist agents | • Multiple LLM agents collaborate: one orchestrates, others specialize • improves performance on tasks requiring diverse expertise or parallel execution. | |
agent decides when/what to retrieve | • Agent dynamically decides when to retrieve, what to search, and how to use results • contrasts with static one-shot RAG by iterating retrieval based on partial answers. | |
env reward → policy update | • Trains agents using verifiable environment rewards (tool execution outcomes, test pass/fail) • enables agents to discover optimal multi-step strategies without human demonstrations. |