Large Language Models are transformer-based neural networks trained on massive text datasets to generate, understand, and manipulate human language at scale. At their core, LLMs use self-attention mechanisms to capture contextual relationships between tokens, enabling them to perform tasks ranging from translation and summarization to code generation and complex reasoning. The field has evolved rapidly—from foundational pre-training on trillions of tokens, through specialized fine-tuning and alignment techniques, to sophisticated reasoning models trained with reinforcement learning from verifiable rewards. A key insight: LLMs don't simply memorize text—their emergent abilities to reason in-context and solve unseen problems arise from the interplay of architecture, scale, and post-training dynamics, with 2025–2026 marking a shift toward agentic, multimodal, and long-context systems.
22 tables, 147 concepts. Select a concept node to jump to its table row.
Table 1: Core Transformer Architecture
| Component | Example | Description |
|---|---|---|
scores = Q @ K.T / sqrt(d_k) | • Computes query-key-value relationships where each token attends to all positions • forms the foundation of transformer parallelization by replacing recurrence. | |
heads = 8Q, K, V = Linear(x, d_k) | • Splits attention into multiple parallel heads learning different representation subspaces • each head computes scaled dot-product attention independently, then concatenates results. | |
mask = torch.triu(ones) * -inf | • Prevents tokens from attending to future positions via upper-triangular mask • critical for autoregressive generation in decoder-only models like GPT. | |
encoder_output → decoder | • Used in encoder-decoder models where decoder queries attend to encoder keys/values • enables translation and seq2seq by bridging input-output representations. | |
FFN(x) = ReLU(xW1 + b1)W2 | • Two-layer MLP applied position-wise after attention • typically expands dimension 4× then projects back, providing non-linearity and feature transformation. | |
output = x + SubLayer(x) | • Adds input directly to sublayer output enabling gradient flow through deep networks • prevents vanishing gradients and allows training of 100+ layer transformers. | |
LN(x) = γ(x - μ) / σ + β | • Normalizes activations across feature dimension for each token • stabilizes training and enables higher learning rates. | |
PE = sin(pos/10000^(2i/d)) | • Injects order information into token embeddings • the transformer is permutation-invariant without explicit position signals. | |
q = RMSNorm(q); k = RMSNorm(k) | • Applies RMS normalization to query and key vectors before dot-product attention • stabilizes training of very large models and is used in Qwen3, Trinity Large. |
Table 2: Positional Encoding Variants
| Method | Example | Description |
|---|---|---|
rotate(q) @ rotate(k).T | • Applies rotation matrices to query-key pairs encoding relative position via complex-plane rotation • used in LLaMA, GPT-NeoX—enables good length extrapolation. | |
pos_emb = Embedding(max_len, d_model) | • Trainable position embeddings learned during training • used in BERT and GPT-2 • limited to max_len seen during training without extrapolation. | |
PE(pos, 2i) = sin(pos/10000^(2i/d))PE(pos, 2i+1) = cos(pos/10000^(2i/d)) | • Original Transformer encoding using fixed sine/cosine functions at different frequencies • provides unique position signals but doesn't explicitly encode relative distances. | |
attn + bias * (-1, -2, -3, ...) | • Adds linear penalty to attention scores based on distance • no position embeddings needed • excellent extrapolation to longer sequences than training context. | |
bias = learned_bias[k - q] | • Encodes distance between positions rather than absolute indices • better length generalization but computationally expensive for long sequences. | |
no pos embeddings in global layers | • Eliminates positional embeddings in global attention layers entirely • relies on architectural bias and training for order • used in SmolLM3 global attention layers. |
Table 3: Model Architecture Variants
| Type | Example | Description |
|---|---|---|
GPT-2, GPT-3, LLaMA, Qwen3 | • Causal masked attention for autoregressive generation • trained to predict next token • dominates modern LLMs due to scalability and generation quality. | |
BERT, RoBERTa | • Bidirectional attention processes entire sequence • excels at understanding tasks like classification, NER, question answering • cannot generate text autoregressively. | |
T5, BART | • Separate encoder (bidirectional) and decoder (causal) with cross-attention • optimal for seq2seq tasks like translation and summarization. | |
router(x) → top-k experts | • Sparsely activated architecture where gating network routes tokens to a subset of expert FFNs • scales parameters without proportional compute increase (Mixtral, DeepSeek-V3). | |
<think>...</think>\nAnswer | • Decoder-only model trained with RL to produce extended chain-of-thought in a scratchpad before the final answer • emergent self-reflection and verification (o1, DeepSeek-R1). | |
CLIP, LLaVA, Gemini | • Integrates vision encoder (ViT) with language model via projection layer or cross-attention • enables multimodal understanding from images and text jointly. |
Table 4: Tokenization Algorithms
| Algorithm | Example | Description |
|---|---|---|
"playing" → ["play", "ing"] | • Iteratively merges most frequent character pairs in training corpus • balances vocabulary size with coverage • used in GPT-2, GPT-3—stores merge rules. | |
treats spaces as token "_" | • Language-agnostic tokenizer operating on raw text without pre-tokenization • encodes whitespace as special character • supports BPE and unigram models • used in T5, LLaMA. | |
"unaffable" → ["un", "##aff", "##able"] | • Similar to BPE but selects merges based on likelihood maximization rather than frequency • used in BERT • saves final vocabulary only, not merge operations. | |
probabilistic token selection | • Maintains vocabulary with token probabilities • removes tokens iteratively to minimize loss • allows multiple segmentations unlike greedy BPE/WordPiece. | |
any byte sequence tokenizable | • Operates on raw UTF-8 bytes so any string is tokenizable without unknown tokens • used in GPT-2 and GPT-4 (tiktoken) • fully language-agnostic. |
Table 5: Pre-training Objectives
| Objective | Example | Description |
|---|---|---|
P(token_i | token_<i) | • Predict next token given left context only • standard decoder-only objective maximizing likelihood of training sequences • used in GPT family. | |
P([MASK] | context) | • Randomly masks ~15% of tokens and predicts them from bidirectional context • BERT's core pre-training objective enabling deep bidirectional representations. | |
predict tokens t+1, t+2, t+3 | • Trains multiple prediction heads to predict several future tokens simultaneously • improves sample efficiency and enables speculative decoding at inference (DeepSeek-V3: 1.8× speedup). | |
corrupt → reconstruct | • Masks or corrupts spans of text then trains model to reconstruct original • T5 frames all tasks as text-to-text generation with varying corruption strategies. | |
bidirectional prefix → causal | • Applies bidirectional attention to prefix then causal attention for continuation • bridges encoder and decoder benefits (UniLM, GLM). | |
align(text, image) | • Trains model to match positive pairs (text-image) while separating negative pairs • CLIP uses dual encoders with contrastive loss for vision-language alignment. |
Table 6: Fine-Tuning Techniques
| Technique | Example | Description |
|---|---|---|
train on (input, output) pairs | • Standard gradient-based training on task-specific labeled data • updates all parameters to adapt pre-trained model to downstream task. | |
"Translate: [text]" → output | • Fine-tunes on diverse instruction-following datasets formatted as explicit commands • dramatically improves zero-shot task generalization (FLAN, InstructGPT). | |
ΔW = BA (rank r << d) | • Freezes base model and trains low-rank decomposition matrices injected into attention layers • reduces trainable parameters by 10,000× while preserving quality. | |
4-bit base + LoRA adapters | • Quantizes base model to 4-bit precision and applies LoRA • enables fine-tuning 65B models on a single 48GB GPU with minimal degradation. | |
W = m · (W₀ + BA) / ‖W₀+BA‖ | • Decomposes pre-trained weights into magnitude and direction components • trains direction via LoRA and magnitude separately • consistently outperforms LoRA with no extra inference cost. | |
insert bottleneck layers | • Adds small trainable feed-forward modules between frozen transformer layers • modular approach allowing task-specific adapters without full model copies. | |
prepend trainable vectors | • Optimizes continuous task-specific vectors prepended to each layer • keeps model frozen • competitive with full fine-tuning on some tasks with 0.1% parameters. | |
optimize soft prompts | • Learns continuous prompt embeddings rather than discrete text • more parameter-efficient than prefix tuning • effectiveness increases with model scale. |
Table 7: Alignment and RLHF
| Method | Example | Description |
|---|---|---|
reward model → PPO training | • Three-stage process: SFT, train reward model on human preferences, optimize policy via PPO • used in ChatGPT and Claude—computationally expensive. | |
sample G outputs → normalize rewards | • Samples a group of responses per prompt and estimates advantages by normalizing rewards within the group • eliminates the critic model, halving memory vs. PPO • used in DeepSeek-R1. | |
directly optimize preferences | • Bypasses reward model by directly optimizing policy on preference pairs using Bradley-Terry model • simpler and more stable than RLHF with comparable results. | |
avg log-prob as implicit reward | • Removes the reference model by using average log-probability of a response as implicit reward • outperforms DPO by 6+ points on AlpacaEval 2 with no extra memory cost. | |
combine SFT + preference in one loss | • Merges SFT and preference alignment into a single training objective using odds ratios • eliminates reference model and separate SFT stage—one pass instead of two. | |
thumbs-up / thumbs-down labels | • Works with binary feedback instead of pairwise preferences • cheaper data collection for production systems with like/dislike signals • derived from prospect theory. | |
unit test pass / math checker → reward | • Uses programmatic verifiers (unit tests, math checkers) as reward signal instead of human labels • enables emergent self-reflection and verification for math and code tasks. | |
clip-higher + token-level loss | • Extends GRPO for long chain-of-thought reasoning with entropy collapse prevention, dynamic sampling, and token-level policy gradient • 50% fewer steps than DeepSeek-R1-Zero. | |
self-critique + revision | • Model critiques its own outputs against constitutional principles then revises • reduces reliance on human feedback for harmlessness alignment. | |
AI-generated preferences | • Replaces human labelers with AI-generated feedback for reward model training • scales preference data collection • effective for capability-focused alignment. | |
score(response) from pairs | • Trains classifier to predict human preference between response pairs • converts subjective preferences into scalar reward signal for RL optimization. |
Table 8: Inference Optimization Techniques
| Technique | Example | Description |
|---|---|---|
cache computed K, V | • Stores previously computed key-value pairs during autoregressive generation • avoids redundant computation—essential for production inference efficiency. | |
cache system prompt KV once | • Reuses KV cache for identical prompt prefixes across requests • skips re-computing static content like system prompts • up to 90% cost and 85% latency reduction for long prompts. | |
tiling + recomputation | • IO-aware attention algorithm using block-wise computation and kernel fusion • reduces memory bandwidth by 5–20× enabling 2–4× speedup on long sequences. | |
block-level memory management | • Manages KV cache in non-contiguous blocks like OS paging • reduces memory fragmentation and increases batch size in vLLM by 2–24×. | |
draft model → verify in parallel | • Small draft model generates candidate tokens that large model verifies in parallel • 2–3× speedup without changing output distribution. | |
extra decoding heads predict t+1, t+2 | • Adds multiple prediction heads to an LLM and verifies candidates via tree-based attention in parallel • 2.2–3.6× speedup without a separate draft model. | |
dynamic request batching | Evicts finished sequences and immediately adds new ones—in-flight batching—maximizes GPU utilization vs. static batching. | |
convert FP16 → INT8 | • Reduces precision of weights/activations to 8-bit or lower • 2–4× memory reduction and speedup with <1% accuracy loss using calibration. | |
improved kernel fusion | • Enhanced IO-aware kernels with lower SRAM usage and better parallelization • Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations. |
Table 9: Sampling and Decoding Strategies
| Strategy | Example | Description |
|---|---|---|
argmax(logits) | • Selects highest probability token at each step • deterministic and fast but prone to repetitive, low-quality outputs for creative tasks. | |
logits / T before softmax | • Scales logits by T—lower T (0.1–0.7) sharpens distribution • higher T (1.0–2.0) increases randomness/creativity. | |
cumulative prob >= p=0.9 | • Dynamically selects smallest set of tokens whose cumulative probability exceeds threshold p • adapts to distribution shape—preferred for open-ended generation. | |
sample from top k=40 tokens | • Restricts sampling to k highest-probability tokens • prevents sampling rare tokens but fixed k can be too restrictive or permissive depending on distribution. | |
filter tokens < min_p * max_prob | • Removes tokens with probability below min_p fraction of the maximum token probability • better than top-p for maintaining quality while allowing creativity. | |
keep top-k sequences | • Maintains k parallel hypotheses and expands most probable • balances quality and diversity but computationally expensive • common for translation. | |
argmax(model_score - α·cos_sim) | • Selects tokens that are probable but distinct from the previous context via cosine similarity penalty • reduces repetition while maintaining coherence. |
Table 10: Training Optimizations
| Technique | Example | Description |
|---|---|---|
store in FP16, compute in FP32 | • Uses 16-bit floats for storage/computation with FP32 master weights • reduces memory by ~2× and accelerates training with Tensor Cores—BF16 preferred for stability. | |
recompute activations | • Trades compute for memory by recomputing forward activations during backward pass instead of storing • enables 2–10× larger models at 20–30% slowdown. | |
accumulate over N steps | • Simulates larger effective batch size by accumulating gradients across microbatches before update • crucial for training large models on limited memory. | |
decouple weight decay | • Fixes weight decay in Adam by applying it directly to weights, not gradient • improves generalization • standard optimizer for transformer training. | |
linear increase 0 → max_lr | • Gradually increases learning rate from zero over initial steps (typically 2–10% of training) • stabilizes training of large models and prevents divergence. | |
lr = min + 0.5(max-min)(1+cos) | • Decreases learning rate following cosine curve • smooth decay helps model converge to flatter minima • often combined with warmup. | |
project grad to low-rank subspace | • Projects gradients into a low-rank subspace via periodic SVD • enables full-rank training with optimizer memory footprint comparable to LoRA • no inference overhead. |
Table 11: Distributed Training Strategies
| Strategy | Example | Description |
|---|---|---|
replicate model across GPUs | • Each GPU holds full model copy processing different data batches • gradients averaged across devices • simplest parallelism but memory-limited by single GPU. | |
split layers across GPUs | • Partitions individual layers (e.g., attention heads) within model across devices • requires frequent all-reduce communication • used in Megatron-LM for very large models. | |
shard optimizer states | • Partitions optimizer states, gradients, parameters across devices with on-demand gathering • ZeRO-3 achieves model parallelism memory efficiency with data parallelism simplicity. | |
PyTorch native ZeRO-3 | • PyTorch implementation of ZeRO-3 sharding • automatically manages parameter gathering/scattering • simpler API than DeepSpeed for distributed training. | |
layer stages on different GPUs | • Splits model vertically by layers creating pipeline stages • microbatching reduces bubble overhead • GPipe, PipeDream frameworks. | |
partition along sequence dim | • Splits sequence length across GPUs for memory-constrained layers like LayerNorm • extends tensor parallelism reducing activation memory in long sequences. |
Table 12: Model Compression Techniques
| Technique | Example | Description |
|---|---|---|
GPTQ, AWQ | • Converts trained model weights to lower precision (INT4/8) without retraining • calibration on small dataset • 3–4× compression with minimal accuracy loss. | |
student learns from teacher | • Trains small student model to match outputs or intermediate representations of large teacher • compresses model while retaining capabilities—DistilBERT example. | |
remove low-magnitude weights | • Removes unimportant weights or entire structured components (heads, layers) • can reduce parameters 40–60% but requires careful calibration or retraining. | |
simulate quantization during training | • Inserts fake quantization operations in forward pass to model precision effects • typically achieves better accuracy than post-training methods. | |
W ≈ UV (rank r) | • Approximates weight matrices as product of low-rank matrices • reduces parameters in linear layers • basis of LoRA and similar PEFT methods. |
Table 13: Context Window and Long-Context Techniques
| Technique | Example | Description |
|---|---|---|
retrieve docs → augment prompt | • Retrieves relevant documents from external knowledge base and injects into context • extends effective knowledge beyond model limits • requires good retrieval system. | |
adjust rotation frequencies | • Modifies RoPE base frequency to extrapolate to longer sequences • simple method enabling 2–4× context extension with minimal fine-tuning. | |
compress position indices | • Interpolates positions within training range rather than extrapolating • better stability than direct extrapolation for extended context. | |
scale + adjust NTK base | • Combines NTK-aware scaling with attention temperature adjustment per head • efficient context extension from 4K to 128K+ tokens. | |
attend to local window | • Each token attends only to fixed-size window around its position • linear memory but limited long-range modeling • used in Longformer. | |
attend to subset of positions | • Computes attention only for selected position pairs using patterns (local, strided, global) • reduces O(n²) complexity enabling 10×+ longer sequences. | |
compress past into memory | • Summarizes earlier context into compressed memory state • enables unbounded context in theory but loses fine-grained information from distant past. |
Table 14: Prompt Engineering Techniques
| Technique | Example | Description |
|---|---|---|
Example1, Example2, ... Query | • Demonstrates task through 2–10 input-output examples in prompt • exploits in-context learning • effectiveness grows with model scale • examples should be diverse. | |
"Translate to French: [text]" | • Provides task instruction only without examples • relies on pre-training and instruction tuning • quality highly dependent on model capabilities and prompt clarity. | |
"Let's think step by step" | • Prompts model to generate intermediate reasoning steps before final answer • dramatically improves performance on math, logic, commonsense reasoning. | |
sample multiple paths → vote | • Generates multiple reasoning paths then selects most consistent answer via majority voting • improves reliability over single-path CoT. | |
Thought → Action → Observation | • Interleaves reasoning and tool use • model generates thoughts, selects actions (API calls, searches), observes results iteratively until solution found. | |
explore reasoning tree | • Explores multiple reasoning branches with backtracking and evaluation • enables deliberate problem-solving for complex tasks requiring search. | |
outline → parallel expand | • Generates a skeleton outline first then expands each point in parallel • reduces end-to-end latency by up to 2× on modern hardware. |
Table 15: Emergent Capabilities and Scaling
| Concept | Example | Description |
|---|---|---|
L(N) ∝ N^(-α) | • Loss scales as power law with compute, model size, dataset size • predicts training compute allocation • vocabulary size also affects optimal scaling. | |
Chinchilla scaling | • For fixed compute budget, balanced scaling of model size and training tokens is optimal • suggests many models are undertrained relative to size. | |
few-shot without gradients | • Ability to learn new tasks from examples in prompt without parameter updates • improves with scale • mechanism may involve induction heads in attention layers. | |
reasoning, arithmetic | • Capabilities that appear suddenly at scale not present in smaller models • includes in-context learning and chain-of-thought reasoning • debated whether truly emergent or metric artifacts. | |
pre-train → fine-tune | • Pre-trained models encode general language understanding transferable to downstream tasks • foundation of modern NLP—larger models transfer better. |
Table 16: Activation Functions in Transformers
| Function | Example | Description |
|---|---|---|
SwiGLU(x) = Swish(xW) ⊙ (xV) | • Gated variant using Swish activation (x·sigmoid(x)) with element-wise gating • used in LLaMA, PaLM—empirically outperforms GELU • requires ~50% more FFN parameters for same hidden size. | |
GELU(x) = x·Φ(x) | • Smooth approximation applying Gaussian CDF • used in BERT, GPT-2 • better gradient properties than ReLU • probabilistic interpretation as neuron dropout. | |
GeGLU(x) = GELU(xW) ⊙ (xV) | • Similar to SwiGLU but uses GELU for gating • strong performance on language tasks • used in T5 variants. | |
ReLU(x) = max(0, x) | • Simple piecewise linear function • original Transformer used ReLU • computationally efficient but can suffer from dead neurons • largely replaced in modern LLMs. |
Table 17: Normalization Techniques
| Method | Example | Description |
|---|---|---|
RMSNorm(x) = x / RMS(x) · γ | • Simplified LN removing mean centering—only normalizes by RMS • 10–20% faster than LN with comparable performance • used in LLaMA, Grok, Qwen3. | |
LN(x) = (x - μ) / σ · γ + β | • Normalizes across feature dimension for each token independently • standard in transformers • mean/variance computed per-sample allowing any batch size. | |
Pre: LN(x) → Sublayer | • Pre-LN applies normalization before sublayer (modern default—more stable) • Post-LN applies after (original Transformer—requires warmup) • Pre-LN enables easier convergence. | |
randomly zero with p=0.1 | • Randomly drops activations during training as regularization • less common in very large LLMs which are underparameterized relative to data • typical rates 0.1–0.2. |
Table 18: Evaluation Metrics and Benchmarks
| Metric | Example | Description |
|---|---|---|
57 subjects, 4-way multiple choice | • Tests knowledge and reasoning across STEM, humanities, social sciences • standard benchmark for general capabilities • 0–100% accuracy. | |
448 PhD-level science questions | • Tests doctoral-level knowledge in biology, physics, chemistry • designed to be hard even with internet access • nearing saturation for frontier models (~94% as of 2026). | |
code synthesis benchmark | • Evaluates code generation with 164 hand-written programming problems • pass@k metric measures functional correctness • standard for coding models. | |
GitHub issue → code fix | • Tests real-world software engineering—resolving GitHub issues in Python repos • measures fraction of issues resolved • key agentic coding benchmark. | |
monthly refreshed questions | • Contamination-resistant benchmark with questions refreshed monthly using recent data sources • covers math, coding, reasoning, language, data analysis. | |
human preference pairwise ranking | • Crowdsourced pairwise preference evaluation • ELO-rated from millions of blind votes • strong signal for real-world conversational quality. | |
PPL = exp(avg_loss) | • Measures how surprised model is by test data • lower is better • exponential of average cross-entropy loss • standard language modeling metric. | |
n-gram precision with brevity | • Compares n-gram overlap between generated and reference translations • 0–100 scale • standard for machine translation evaluation. | |
ROUGE-L, ROUGE-N | • Measures recall-oriented n-gram overlap • primarily for summarization • ROUGE-L uses longest common subsequence. | |
contextual embedding similarity | • Computes token similarity using BERT embeddings rather than exact matches • captures semantic similarity better than n-gram metrics. |
Table 19: Attention Mechanism Optimizations
| Optimization | Example | Description |
|---|---|---|
heads share K, V in groups | • Compromise between MQA and MHA: groups of heads share K, V • balances quality-efficiency tradeoff • used in LLaMA-2/3, Mistral, Qwen3. | |
compress KV to latent → cache | • Compresses keys and values into a low-rank latent vector (joint KV compression) before caching • reduces KV cache by up to 93% vs. MHA with better modeling quality • used in DeepSeek-V3, Kimi K2. | |
single K, V across heads | • Shares same key-value projections across all heads using only multiple queries • reduces KV cache memory and speeds inference but slightly lower quality than GQA. | |
improved kernel fusion | • Enhanced IO-aware kernels with lower SRAM usage and better parallelization • Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations. | |
attend to k nearest tokens | • Restricts attention to fixed-size local window • reduces complexity to O(n·k) from O(n²) • enables longer sequences but limited global context. | |
kernel trick approximation | • Approximates softmax attention using kernel methods reducing complexity to O(n) • enables efficient very long sequences but quality gaps remain vs. full attention. |
Table 20: Advanced Training Techniques
| Technique | Example | Description |
|---|---|---|
train on multiple tasks jointly | • Shares parameters across tasks expecting positive transfer • T5 frames everything as text-to-text • requires balanced sampling and task weighting. | |
easy → hard examples | • Orders training data from simple to complex • can improve convergence and final performance • domain-specific curriculum design needed. | |
incremental data updates | • Updates model on new data without forgetting previous knowledge • addresses catastrophic forgetting through rehearsal, regularization, or architectural solutions. | |
SimCLR, CLIP | • Learns representations by contrasting positive pairs against negatives • CLIP aligns text-image pairs • effective for self-supervised and multimodal learning. | |
backtranslation, paraphrasing | • Generates synthetic training variations from existing data • back-translation, EDA, GPT-generated examples • particularly useful for low-resource tasks. |
Table 21: Model Merging Techniques
| Technique | Example | Description |
|---|---|---|
t=0.5 between model A and B | • Smoothly interpolates between two models' weights in spherical space preserving geometric properties • best for high-quality pairwise merges • limited to two models at a time. | |
trim → elect sign → disjoint merge | • Three-step process: trim redundant parameters, elect dominant sign direction, merge aligned parameters • handles multi-model merging by resolving parameter conflicts. | |
drop delta weights p=0.9, rescale | • Randomly drops task-vector delta weights then rescales remaining by 1/(1−p) • effective even dropping 90–99% of deltas • used as augment for TIES or Task Arithmetic. | |
task_vector = fine_tuned - pretrained | • Computes task vectors (delta weights) and combines via arithmetic • add vectors to merge capabilities, negate to remove behaviors • simple and composable. | |
layers 0-32 of A + 24-32 of B | • Concatenates layers from different models to create frankenmerge with exotic parameter counts (e.g., 9B from two 7B models) • experimental but can produce capable models. | |
evolutionary search over merge configs | • Uses evolutionary algorithms to automatically discover optimal merging recipes and hyperparameters • 50× cost reduction via MERGE³ on single GPU • produces SOTA merged models. | |
average weights of N fine-tuned models | • Averages weights of multiple fine-tuned versions of same base model • improves accuracy without increasing inference cost • greedy variant evaluates each addition. |
Table 22: LLM Agent Concepts
| Concept | Example | Description |
|---|---|---|
{"function": "search", "args": {...}} | • LLM selects and invokes external functions (APIs, search, code execution) via structured JSON • the defining capability separating conversational models from agents. | |
Thought → Action → Observation → … | • Agent iterates Thought → Action → Observation cycles until task is complete • interleaves reasoning and acting for grounded multi-step problem solving. | |
context window + vector store | • Short-term: in-context window, cleared after session • Long-term: external vector store or database enabling retrieval across sessions. | |
task → subtask1, subtask2, ... | • Agent breaks large tasks into manageable subtasks via chain-of-thought or tree-of-thought • can reflect and revise plans based on intermediate results. | |
orchestrator → specialist agents | • Multiple LLM agents collaborate: one orchestrates, others specialize • improves performance on tasks requiring diverse expertise or parallel execution. | |
agent decides when/what to retrieve | • Agent dynamically decides when to retrieve, what to search, and how to use results • contrasts with static one-shot RAG by iterating retrieval based on partial answers. | |
env reward → policy update | • Trains agents using verifiable environment rewards (tool execution outcomes, test pass/fail) • enables agents to discover optimal multi-step strategies without human demonstrations. |
References
Official Documentation & Foundational Papers
- Attention Is All You Need - https://arxiv.org/abs/1706.03762
- BERT: Pre-training of Deep Bidirectional Transformers - https://arxiv.org/abs/1810.04805
- Language Models are Few-Shot Learners (GPT-3) - https://arxiv.org/abs/2005.14165
- Improving Language Understanding by Generative Pre-Training (GPT) - https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
- LLaMA: Open and Efficient Foundation Language Models - https://arxiv.org/abs/2302.13971
- Transformer Model Documentation - PyTorch - https://pytorch.org/docs/stable/nn.html#transformer
- Hugging Face Transformers Library - https://huggingface.co/docs/transformers/
- T5: Exploring the Limits of Transfer Learning - https://arxiv.org/abs/1910.10683
- RoBERTa: A Robustly Optimized BERT Pretraining Approach - https://arxiv.org/abs/1907.11692
- DeepSeek-V3 Technical Report - https://arxiv.org/abs/2412.19437
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - https://arxiv.org/abs/2501.12948
Architecture and Attention Mechanisms
- Multi-Head Attention Explained - d2l.ai - https://d2l.ai/chapter_attention-mechanisms-and-transformers/multihead-attention.html
- The Illustrated Transformer - Jay Alammar - https://jalammar.github.io/illustrated-transformer/
- FlashAttention: Fast and Memory-Efficient Exact Attention - https://arxiv.org/abs/2205.14135
- FlashAttention-2: Faster Attention with Better Parallelism - https://arxiv.org/abs/2307.08691
- Efficient Memory Management for Large Language Model Serving (PagedAttention) - https://arxiv.org/abs/2309.06180
- Self-Attention with Relative Position Representations - https://arxiv.org/abs/1803.02155
- RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) - https://arxiv.org/abs/2104.09864
- Train Short, Test Long: Attention with Linear Biases (ALiBi) - https://arxiv.org/abs/2108.12409
- Multi-Query Attention for Faster Inference - https://arxiv.org/abs/1911.02150
- GQA: Training Generalized Multi-Query Transformer - https://arxiv.org/abs/2305.13245
- Multi-Head Latent Attention (MLA) - Sebastian Raschka - https://sebastianraschka.com/llm-architecture-gallery/mla/
- DeepSeek-V2: Multi-Head Latent Attention Paper - https://arxiv.org/abs/2405.04434
- LLM Architecture Gallery 2026 - SesameDisK - https://sesamedisk.com/llm-architecture-gallery-2026/
- The Inner Workings of DeepSeek-V3 - Chris McCormick - https://mccormickml.com/2025/02/12/the-inner-workings-of-deep-seek-v3/
- NoPE: No Positional Encoding in Transformers - https://arxiv.org/abs/2404.12224
- Linear Transformer (Linear Attention) - https://arxiv.org/abs/2006.16236
- On the Relationship between Self-Attention and Convolutional Layers - https://arxiv.org/abs/1911.03584
- QK-Norm in Transformers - Scaling ViT - https://arxiv.org/abs/2302.05442
Tokenization and Preprocessing
- Neural Machine Translation of Rare Words with Subword Units (BPE) - https://arxiv.org/abs/1508.07909
- Google's Neural Machine Translation System (WordPiece) - https://arxiv.org/abs/1609.08144
- SentencePiece: A simple and language independent approach - https://arxiv.org/abs/1808.06226
- Subword Regularization: Improving Neural Network Translation (Unigram) - https://arxiv.org/abs/1804.10959
- Byte Pair Encoding Implementation Guide - Hugging Face - https://huggingface.co/learn/nlp-course/chapter6/5
Training and Optimization
- Decoupled Weight Decay Regularization (AdamW) - https://arxiv.org/abs/1711.05101
- Mixed Precision Training - https://arxiv.org/abs/1710.03740
- Training with Gradient Checkpointing - https://arxiv.org/abs/1604.06174
- Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (Warmup) - https://arxiv.org/abs/1706.02677
- SGDR: Stochastic Gradient Descent with Warm Restarts - https://arxiv.org/abs/1608.03983
- Scaling Laws for Neural Language Models - https://arxiv.org/abs/2001.08361
- Training Compute-Optimal Large Language Models (Chinchilla) - https://arxiv.org/abs/2203.15556
- Scaling Laws with Vocabulary Size - https://arxiv.org/abs/2407.13623
- Layer Normalization - https://arxiv.org/abs/1607.06450
- Root Mean Square Layer Normalization (RMSNorm) - https://arxiv.org/abs/1910.07467
- On Layer Normalization in the Transformer Architecture (Pre-LN) - https://arxiv.org/abs/2002.04745
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection - https://arxiv.org/abs/2403.03507
- Multi-Token Prediction (Better & Faster LLMs) - https://arxiv.org/abs/2404.19737
Fine-Tuning and Adaptation
- LoRA: Low-Rank Adaptation of Large Language Models - https://arxiv.org/abs/2106.09685
- QLoRA: Efficient Finetuning of Quantized LLMs - https://arxiv.org/abs/2305.14314
- DoRA: Weight-Decomposed Low-Rank Adaptation - https://arxiv.org/abs/2402.09353
- Parameter-Efficient Transfer Learning (Adapter Modules) - https://arxiv.org/abs/1902.00751
- Prefix-Tuning: Optimizing Continuous Prompts - https://arxiv.org/abs/2101.00190
- The Power of Scale for Parameter-Efficient Prompt Tuning - https://arxiv.org/abs/2104.08691
- P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning - https://arxiv.org/abs/2110.07602
- Finetuned Language Models Are Zero-Shot Learners (FLAN) - https://arxiv.org/abs/2109.01652
- Scaling Instruction-Finetuned Language Models - https://arxiv.org/abs/2210.11416
- Hugging Face PEFT Library - https://github.com/huggingface/peft
Alignment and RLHF
- Training Language Models to Follow Instructions (InstructGPT/RLHF) - https://arxiv.org/abs/2203.02155
- Direct Preference Optimization (DPO) - https://arxiv.org/abs/2305.18290
- SimPO: Simple Preference Optimization with a Reference-Free Reward - https://arxiv.org/abs/2405.14734
- ORPO: Monolithic Preference Optimization without Reference Model - https://arxiv.org/abs/2403.07691
- KTO: Model Alignment as Prospect Theoretic Optimization - https://arxiv.org/abs/2402.01306
- Constitutional AI: Harmlessness from AI Feedback - https://arxiv.org/abs/2212.08073
- RLAIF: Scaling Reinforcement Learning from Human Feedback - https://arxiv.org/abs/2309.00267
- Learning to Summarize from Human Feedback - https://arxiv.org/abs/2009.01325
- Training a Helpful and Harmless Assistant with RLHF - https://arxiv.org/abs/2204.05862
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning (GRPO) - https://arxiv.org/abs/2402.03300
- DAPO: An Open-Source LLM Reinforcement Learning System - https://arxiv.org/abs/2503.14476
- Post-Training in 2026: GRPO, DAPO, RLVR & Beyond - https://llm-stats.com/blog/research/post-training-techniques-2026
- DPO Variants: IPO, KTO, ORPO - https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment
- Kimi k1.5: Scaling Reinforcement Learning with LLMs - https://arxiv.org/abs/2501.12599
- OLMo 2: Fully Open Language Models - https://arxiv.org/abs/2501.00656
- Group Relative Policy Optimization (GRPO) - Illustrated Breakdown - https://epichka.com/blog/2025/grpo/
- GRPO Deep Dive - Cameron Wolfe - https://cameronrwolfe.substack.com/p/grpo
Inference and Optimization
- Fast Inference from Transformers via Speculative Decoding - https://arxiv.org/abs/2211.17192
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads - https://arxiv.org/abs/2401.10774
- GPTQ: Accurate Post-Training Quantization - https://arxiv.org/abs/2210.17323
- AWQ: Activation-aware Weight Quantization - https://arxiv.org/abs/2306.00978
- SmoothQuant: Accurate and Efficient Post-Training Quantization - https://arxiv.org/abs/2211.10438
- LLM.int8(): 8-bit Matrix Multiplication for Transformers - https://arxiv.org/abs/2208.07339
- Continuous Batching for LLM Inference - https://www.anyscale.com/blog/continuous-batching-llm-inference
- Prefix Caching - BentoML LLM Inference Handbook - https://bentoml.com/llm/inference-optimization/prefix-caching
- LLM Inference Optimization Techniques - Redwerk - https://redwerk.com/blog/llm-inference-optimization-techniques/
- Prompt Caching: Up to 90% Cost Reduction - https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- KV Cache Optimization Guide - https://blog.dailydoseofds.com/p/a-practical-deep-dive-on-llm-inference
- LLM Inference Optimization Guide - Morphllm - https://www.morphllm.com/llm-inference-optimization
Distributed Training
- Megatron-LM: Training Multi-Billion Parameter Language Models - https://arxiv.org/abs/1909.08053
- GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism - https://arxiv.org/abs/1811.06965
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models - https://arxiv.org/abs/1910.02054
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel - https://arxiv.org/abs/2304.11277
- Reducing Activation Recomputation in Large Transformer Models (Sequence Parallelism) - https://arxiv.org/abs/2205.05198
Model Compression
- Distilling the Knowledge in a Neural Network - https://arxiv.org/abs/1503.02531
- SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot - https://arxiv.org/abs/2301.00774
- Wanda: A Simple and Effective Pruning Approach for LLMs - https://arxiv.org/abs/2306.11695
Long Context and RAG
- Retrieval-Augmented Generation for Knowledge-Intensive NLP - https://arxiv.org/abs/2005.11401
- Longformer: The Long-Document Transformer - https://arxiv.org/abs/2004.05150
- Extending Context Window via Position Interpolation - https://arxiv.org/abs/2306.15595
- YaRN: Efficient Context Window Extension - https://arxiv.org/abs/2309.00071
- Lost in the Middle: How Language Models Use Long Contexts - https://arxiv.org/abs/2307.03172
Prompt Engineering
- Chain-of-Thought Prompting Elicits Reasoning - https://arxiv.org/abs/2201.11903
- Self-Consistency Improves Chain of Thought Reasoning - https://arxiv.org/abs/2203.11171
- Tree of Thoughts: Deliberate Problem Solving - https://arxiv.org/abs/2305.10601
- ReAct: Synergizing Reasoning and Acting in Language Models - https://arxiv.org/abs/2210.03629
- Skeleton-of-Thought: LLMs Can Do Parallel Decoding - https://arxiv.org/abs/2307.15337
- The Prompt Report: A Systematic Survey - https://arxiv.org/abs/2406.06608
- Prompt Engineering Guide - https://www.promptingguide.ai/
Sampling and Decoding
- The Curious Case of Neural Text Degeneration (Nucleus Sampling) - https://arxiv.org/abs/1904.09751
- Hierarchical Neural Story Generation (Top-k) - https://arxiv.org/abs/1805.04833
- Contrastive Search for Better Language Generation - https://arxiv.org/abs/2210.14140
- Min-p Sampling: Balancing Quality and Diversity - https://arxiv.org/abs/2407.01082
Emergent Capabilities and Scaling
- Emergent Abilities of Large Language Models - https://arxiv.org/abs/2206.07682
- Are Emergent Abilities a Mirage? - https://arxiv.org/abs/2304.15004
- In-context Learning and Induction Heads - https://arxiv.org/abs/2209.11895
- A Survey of Large Language Models - https://arxiv.org/abs/2303.18223
Multimodal and Vision-Language
- Learning Transferable Visual Models From Natural Language (CLIP) - https://arxiv.org/abs/2103.00020
- Flamingo: a Visual Language Model for Few-Shot Learning - https://arxiv.org/abs/2204.14198
- Visual Instruction Tuning (LLaVA) - https://arxiv.org/abs/2304.08485
- Gemini: A Family of Highly Capable Multimodal Models - https://arxiv.org/abs/2312.11805
Activation Functions
- Gaussian Error Linear Units (GELUs) - https://arxiv.org/abs/1606.08415
- GLU Variants Improve Transformer (SwiGLU/GeGLU) - https://arxiv.org/abs/2002.05202
- Swish: A Self-Gated Activation Function - https://arxiv.org/abs/1710.05941
Architecture Variants
- Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer - https://arxiv.org/abs/1701.06538
- Switch Transformers: Scaling to Trillion Parameter Models - https://arxiv.org/abs/2101.03961
- Mixtral of Experts - https://arxiv.org/abs/2401.04088
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts - https://arxiv.org/abs/2112.06905
Evaluation and Benchmarks
- Measuring Massive Multitask Language Understanding (MMLU) - https://arxiv.org/abs/2009.03300
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark - https://arxiv.org/abs/2311.12022
- GPQA Benchmark Scores 2026 - BenchLM.ai - https://benchlm.ai/benchmarks/gpqa
- Evaluating Large Language Models Trained on Code (HumanEval) - https://arxiv.org/abs/2107.03374
- SWE-bench: Can Language Models Resolve Real GitHub Issues? - https://arxiv.org/abs/2310.06770
- LiveBench: A Challenging, Contamination-Free LLM Benchmark - https://arxiv.org/abs/2406.19314
- LiveBench Leaderboard - https://livebench.ai/
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference - https://chat.lmsys.org/
- BERTScore: Evaluating Text Generation with BERT - https://arxiv.org/abs/1904.09675
- BLEU: a Method for Automatic Evaluation of Machine Translation - https://aclanthology.org/P02-1040/
- ROUGE: A Package for Automatic Evaluation of Summaries - https://aclanthology.org/W04-1013/
Model Merging
- Merge Large Language Models with mergekit - Hugging Face Blog - https://huggingface.co/blog/mlabonne/merge-models
- TIES-Merging: Resolving Interference When Merging Models - https://arxiv.org/abs/2306.01708
- Language Models are Super Mario: Absorbing Abilities with DARE - https://arxiv.org/abs/2311.03099
- Editing Models with Task Arithmetic - https://arxiv.org/abs/2212.04089
- Model Soups: Averaging Weights of Multiple Fine-Tuned Models - https://arxiv.org/abs/2203.05482
- Evolutionary Optimization of Model Merging Recipes - Nature Machine Intelligence - https://www.nature.com/articles/s42256-024-00975-8
- An Introduction to Model Merging for LLMs - NVIDIA Technical Blog - https://developer.nvidia.com/blog/an-introduction-to-model-merging-for-llms/
- mergekit - Arcee AI - https://github.com/arcee-ai/mergekit
- Model Merging for LLMs 2026 - Zylos Research - https://zylos.ai/research/2026-01-24-model-merging-llm
Agentic AI and Tool Use
- ReAct: Synergizing Reasoning and Acting - https://arxiv.org/abs/2210.03629
- LLM Agents: The Ultimate Guide 2026 - SuperAnnotate - https://www.superannotate.com/blog/llm-agents
- Agentic Artificial Intelligence: Architectures, Taxonomies - https://arxiv.org/html/2601.12560v1
- Tool Use and Function Calling in AI Agents 2026 - Zylos Research - https://zylos.ai/research/2026-04-07-tool-use-function-calling-standards-benchmarks
Technical Blogs and Tutorials
- The Illustrated Transformer - Jay Alammar - https://jalammar.github.io/illustrated-transformer/
- Understanding and Coding Self-Attention - Sebastian Raschka - https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention
- LLM Training Guide - Hugging Face (StackLLaMA) - https://huggingface.co/blog/stackllama
- DeepSpeed Documentation - Microsoft - https://www.deepspeed.ai/
- Megatron-LM Training Guide - NVIDIA - https://docs.nvidia.com/megatron-core/
- vLLM Inference Server - UC Berkeley - https://docs.vllm.ai/
- Understanding Encoder and Decoder LLMs - Sebastian Raschka - https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder
- Flash Attention Explained - DataCamp - https://www.datacamp.com/blog/flash-attention
- LLMs in 2026: What's Real, What's Hype - Infotech - https://www.infotech.com/digital-disruption/llms-in-2026-what-s-real-what-s-hype-and-what-s-coming-next
- Large Language Models and AI Engineering in 2026 - The AI Cowboys - https://theaicowboys.com/blog/large-language-models-llms-ai-engineering-2026
Advanced Topics
- Contrastive Learning with SimCLR - https://arxiv.org/abs/2002.05709
- Knowledge Distillation Survey - https://arxiv.org/abs/2006.05525
- Curriculum Learning for NLP - https://arxiv.org/abs/2101.10382
- Continual Learning for LLMs - https://arxiv.org/abs/2302.00487
- A Survey on In-context Learning - https://arxiv.org/abs/2301.00234
- State Space Models (Mamba) - https://arxiv.org/abs/2312.00752
- Transformer Quality in Linear Time - https://arxiv.org/abs/2202.10447
- Reasoning Models Generate Societies of Thought (DeepSeek-R1) - https://arxiv.org/html/2601.10825v1
- DeepSeek-R1 incentivizes reasoning through pure RL - Nature - https://www.nature.com/articles/s41586-025-09422-z
Industry Resources
- OpenAI API Documentation - https://platform.openai.com/docs/
- Anthropic Claude Documentation - https://docs.anthropic.com/
- Google Gemini Technical Report - https://deepmind.google/technologies/gemini/
- Meta LLaMA Model Card - https://github.com/facebookresearch/llama
- Mistral AI Documentation - https://docs.mistral.ai/
- Cohere LLM Documentation - https://docs.cohere.com/
- Together AI Platform - https://docs.together.ai/
- Weights & Biases LLM Training - https://wandb.ai/site/solutions/llmops
Video Resources
- Andrej Karpathy's Neural Networks: Zero to Hero - https://karpathy.ai/zero-to-hero.html
- Stanford CS324 - Large Language Models - https://stanford-cs324.github.io/winter2022/
- Stanford CS336 Language Modeling from Scratch Spring 2026 - https://www.youtube.com/watch?v=lVynu4bo1rY
- DeepLearning.AI LLM Courses - https://www.deeplearning.ai/courses/
- How to Train LLMs to Think (o1 & DeepSeek-R1) - YouTube - https://www.youtube.com/watch?v=RveLjcNl0ds
GitHub Repositories
- transformers - Hugging Face - https://github.com/huggingface/transformers
- llama - Meta AI - https://github.com/facebookresearch/llama
- flash-attention - Dao-AILab - https://github.com/Dao-AILab/flash-attention
- vllm - UC Berkeley - https://github.com/vllm-project/vllm
- DeepSpeed - Microsoft - https://github.com/microsoft/DeepSpeed
- Megatron-LM - NVIDIA - https://github.com/NVIDIA/Megatron-LM
- peft - Hugging Face - https://github.com/huggingface/peft
- axolotl - OpenAccess AI Collective - https://github.com/OpenAccess-AI-Collective/axolotl
- llama.cpp - ggerganov - https://github.com/ggerganov/llama.cpp
- Medusa - FasterDecoding - https://github.com/FasterDecoding/Medusa
- mergekit - Arcee AI - https://github.com/arcee-ai/mergekit
Research Conferences and Archives
- NeurIPS 2025 Papers - https://neurips.cc/
- ICLR 2026 Papers - https://iclr.cc/
- ACL 2026 Findings - https://aclanthology.org/
- ICML 2025 Proceedings - https://icml.cc/
- arXiv cs.CL Recent Papers - https://arxiv.org/list/cs.CL/recent