Large Language Models (LLMs) Cheat Sheet

Tables

Updated 2026-04-28

Next Topic: Llama Models (Meta) Cheat Sheet

Large Language Models are transformer-based neural networks trained on massive text datasets to generate, understand, and manipulate human language at scale. At their core, LLMs use self-attention mechanisms to capture contextual relationships between tokens, enabling them to perform tasks ranging from translation and summarization to code generation and complex reasoning. The field has evolved rapidly—from foundational pre-training on trillions of tokens, through specialized fine-tuning and alignment techniques, to sophisticated reasoning models trained with reinforcement learning from verifiable rewards. A key insight: LLMs don't simply memorize text—their emergent abilities to reason in-context and solve unseen problems arise from the interplay of architecture, scale, and post-training dynamics, with 2025–2026 marking a shift toward agentic, multimodal, and long-context systems.

Quick Index147 entries · 22 tables

Mind Map

22 tables, 147 concepts. Select a concept node to jump to its table row.

Preparing mind map...

Table 1: Core Transformer Architecture

Every modern LLM is built from this handful of components stacked dozens of times over. Self-attention is the heart of it — letting each token weigh every other token — and the rest exist to make that work at depth and scale: feed-forward layers add non-linear capacity, residual connections and normalization keep gradients flowing through hundreds of layers, and positional encoding restores the word-order information that attention alone throws away. Master these and you understand the skeleton shared by GPT, LLaMA, and Qwen alike.

Component	Example	Description
Self-attention	`scores = Q @ K.T / sqrt(d_k)`	• Computes query-key-value relationships where each token attends to all positions • forms the foundation of transformer parallelization by replacing recurrence.
Multi-head attention	`heads = 8` `Q, K, V = Linear(x, d_k)`	• Splits attention into multiple parallel heads learning different representation subspaces • each head computes scaled dot-product attention independently, then concatenates results.
Causal (masked) attention	`mask = torch.triu(ones) * -inf`	• Prevents tokens from attending to future positions via upper-triangular mask • critical for autoregressive generation in decoder-only models like GPT.
Cross-attention	`encoder_output → decoder`	• Used in encoder-decoder models where decoder queries attend to encoder keys/values • enables translation and seq2seq by bridging input-output representations.
Feed-forward network (FFN)	`FFN(x) = ReLU(xW1 + b1)W2`	• Two-layer MLP applied position-wise after attention • typically expands dimension 4× then projects back, providing non-linearity and feature transformation.
Residual connections	`output = x + SubLayer(x)`	• Adds input directly to sublayer output enabling gradient flow through deep networks • prevents vanishing gradients and allows training of 100+ layer transformers.
Layer normalization	`LN(x) = γ(x - μ) / σ + β`	• Normalizes activations across feature dimension for each token • stabilizes training and enables higher learning rates.
Positional encoding	`PE = sin(pos/10000^(2i/d))`	• Injects order information into token embeddings • the transformer is permutation-invariant without explicit position signals.
QK-Norm	`q = RMSNorm(q); k = RMSNorm(k)`	• Applies RMS normalization to query and key vectors before dot-product attention • stabilizes training of very large models and is used in Qwen3, Trinity Large.

Table 2: Positional Encoding Variants

Since attention treats a sequence as an unordered bag of tokens, a model needs some way to know which word came first — and the choice of how to inject that order has outsized consequences for how far a model can extrapolate beyond its training length. These variants trace the field's evolution from the original fixed sinusoids and learned absolute embeddings toward relative schemes like RoPE and ALiBi that generalize to longer contexts, and even NoPE, which drops explicit position signals entirely.

Method	Example	Description
RoPE (Rotary Position Embedding)	`rotate(q) @ rotate(k).T`	• Applies rotation matrices to query-key pairs encoding relative position via complex-plane rotation • used in LLaMA, GPT-NeoX—enables good length extrapolation.
Learned absolute	`pos_emb = Embedding(max_len, d_model)`	• Trainable position embeddings learned during training • used in BERT and GPT-2 • limited to max_len seen during training without extrapolation.
Absolute sinusoidal	`PE(pos, 2i) = sin(pos/10000^(2i/d))` `PE(pos, 2i+1) = cos(pos/10000^(2i/d))`	• Original Transformer encoding using fixed sine/cosine functions at different frequencies • provides unique position signals but doesn't explicitly encode relative distances.
ALiBi (Attention with Linear Biases)	`attn + bias * (-1, -2, -3, ...)`	• Adds linear penalty to attention scores based on distance • no position embeddings needed • excellent extrapolation to longer sequences than training context.
Relative positional encoding	`bias = learned_bias[k - q]`	• Encodes distance between positions rather than absolute indices • better length generalization but computationally expensive for long sequences.
NoPE (No Positional Encoding)	`no pos embeddings in global layers`	• Eliminates positional embeddings in global attention layers entirely • relies on architectural bias and training for order • used in SmolLM3 global attention layers.

Table 3: Model Architecture Variants

The same transformer building blocks can be wired in fundamentally different ways, and that wiring decides what a model is good at. Decoder-only models dominate today because their autoregressive design scales and generates so well, but encoder-only models still win on pure understanding tasks, and encoder-decoder models suit translation. The later rows cover the structural innovations reshaping frontier models — Mixture of Experts for scaling parameters cheaply, reasoning models that think before answering, and vision-language models that fold images into the same context.

Type	Example	Description
Decoder-only	`GPT-2, GPT-3, LLaMA, Qwen3`	• Causal masked attention for autoregressive generation • trained to predict next token • dominates modern LLMs due to scalability and generation quality.
Encoder-only	`BERT, RoBERTa`	• Bidirectional attention processes entire sequence • excels at understanding tasks like classification, NER, question answering • cannot generate text autoregressively.
Encoder-decoder	`T5, BART`	• Separate encoder (bidirectional) and decoder (causal) with cross-attention • optimal for seq2seq tasks like translation and summarization.
Mixture of Experts (MoE)	`router(x) → top-k experts`	• Sparsely activated architecture where gating network routes tokens to a subset of expert FFNs • scales parameters without proportional compute increase (Mixtral, DeepSeek-V3).
Reasoning model	`<think>...</think>\nAnswer`	• Decoder-only model trained with RL to produce extended chain-of-thought in a scratchpad before the final answer • emergent self-reflection and verification (o1, DeepSeek-R1).
Vision-language	`CLIP, LLaVA, Gemini`	• Integrates vision encoder (ViT) with language model via projection layer or cross-attention • enables multimodal understanding from images and text jointly.

Table 4: Tokenization Algorithms

Before a model ever sees text it has to be chopped into tokens, and the algorithm that does the chopping shapes vocabulary size, how gracefully the model handles rare or unseen words, and even how well it works across languages. These methods mostly differ in how they decide which character or subword pieces to merge — by frequency in BPE, by likelihood in WordPiece, by probability in Unigram — with byte-level variants guaranteeing that any string at all can be encoded without unknown tokens.

Algorithm	Example	Description
Byte-Pair Encoding (BPE)	`"playing" → ["play", "ing"]`	• Iteratively merges most frequent character pairs in training corpus • balances vocabulary size with coverage • used in GPT-2, GPT-3—stores merge rules.
SentencePiece	`treats spaces as token "_"`	• Language-agnostic tokenizer operating on raw text without pre-tokenization • encodes whitespace as special character • supports BPE and unigram models • used in T5, LLaMA.
WordPiece	`"unaffable" → ["un", "##aff", "##able"]`	• Similar to BPE but selects merges based on likelihood maximization rather than frequency • used in BERT • saves final vocabulary only, not merge operations.
Unigram language model	`probabilistic token selection`	• Maintains vocabulary with token probabilities • removes tokens iteratively to minimize loss • allows multiple segmentations unlike greedy BPE/WordPiece.
Byte-level BPE	`any byte sequence tokenizable`	• Operates on raw UTF-8 bytes so any string is tokenizable without unknown tokens • used in GPT-2 and GPT-4 (tiktoken) • fully language-agnostic.

Table 5: Pre-training Objectives

The pre-training objective is the self-supervised game a model plays over trillions of tokens, and it determines what kind of model you end up with. Predicting the next token (CLM) gives you a generator like GPT; masking and predicting tokens from both sides (MLM) gives you a bidirectional understander like BERT; corruption-and-reconstruction objectives power text-to-text models like T5. The remaining rows cover newer twists like multi-token prediction for speed and contrastive learning for aligning text with images.

Objective	Example	Description
Causal language modeling (CLM)	`P(token_i \| token_<i)`	• Predict next token given left context only • standard decoder-only objective maximizing likelihood of training sequences • used in GPT family.
Masked language modeling (MLM)	`P([MASK] \| context)`	• Randomly masks ~15% of tokens and predicts them from bidirectional context • BERT's core pre-training objective enabling deep bidirectional representations.
Multi-token prediction (MTP)	`predict tokens t+1, t+2, t+3`	• Trains multiple prediction heads to predict several future tokens simultaneously • improves sample efficiency and enables speculative decoding at inference (DeepSeek-V3: 1.8× speedup).
Denoising autoencoding	`corrupt → reconstruct`	• Masks or corrupts spans of text then trains model to reconstruct original • T5 frames all tasks as text-to-text generation with varying corruption strategies.
Prefix language modeling	`bidirectional prefix → causal`	• Applies bidirectional attention to prefix then causal attention for continuation • bridges encoder and decoder benefits (UniLM, GLM).
Contrastive learning	`align(text, image)`	• Trains model to match positive pairs (text-image) while separating negative pairs • CLIP uses dual encoders with contrastive loss for vision-language alignment.

Table 6: Fine-Tuning Techniques

Once a model is pre-trained, fine-tuning adapts it to a specific task or behavior — and the central tension here is cost versus completeness. Full supervised fine-tuning updates every parameter, but the parameter-efficient methods that follow (LoRA, QLoRA, DoRA, adapters, prefix and prompt tuning) freeze the base model and train a tiny fraction of new weights instead, slashing memory enough to fine-tune huge models on a single GPU while keeping quality close. Instruction tuning sits apart as the step that teaches a model to follow commands rather than just continue text.

Technique	Example	Description
Supervised fine-tuning (SFT)	`train on (input, output) pairs`	• Standard gradient-based training on task-specific labeled data • updates all parameters to adapt pre-trained model to downstream task.
Instruction tuning	`"Translate: [text]" → output`	• Fine-tunes on diverse instruction-following datasets formatted as explicit commands • dramatically improves zero-shot task generalization (FLAN, InstructGPT).
LoRA (Low-Rank Adaptation)	`ΔW = BA` (rank r << d)	• Freezes base model and trains low-rank decomposition matrices injected into attention layers • reduces trainable parameters by 10,000× while preserving quality.
QLoRA	`4-bit base + LoRA adapters`	• Quantizes base model to 4-bit precision and applies LoRA • enables fine-tuning 65B models on a single 48GB GPU with minimal degradation.
DoRA (Weight-Decomposed LoRA)	`W = m · (W₀ + BA) / ‖W₀+BA‖`	• Decomposes pre-trained weights into magnitude and direction components • trains direction via LoRA and magnitude separately • consistently outperforms LoRA with no extra inference cost.
Adapter modules	`insert bottleneck layers`	• Adds small trainable feed-forward modules between frozen transformer layers • modular approach allowing task-specific adapters without full model copies.
Prefix tuning	`prepend trainable vectors`	• Optimizes continuous task-specific vectors prepended to each layer • keeps model frozen • competitive with full fine-tuning on some tasks with 0.1% parameters.
P-tuning / Prompt tuning	`optimize soft prompts`	• Learns continuous prompt embeddings rather than discrete text • more parameter-efficient than prefix tuning • effectiveness increases with model scale.

Table 7: Alignment and RLHF

Alignment is what turns a raw next-token predictor into a model that's helpful, harmless, and actually does what you ask, and this is one of the fastest-moving areas in the field. The lineage runs from classic three-stage RLHF (reward model plus PPO) toward simpler, cheaper successors — DPO and its variants skip the separate reward model, GRPO drops the critic, and RLVR replaces human labels with programmatic verifiers, which is precisely what unlocked the self-reflecting reasoning models like DeepSeek-R1.

Method	Example	Description
RLHF (Reinforcement Learning from Human Feedback)	`reward model → PPO training`	• Three-stage process: SFT, train reward model on human preferences, optimize policy via PPO • used in ChatGPT and Claude—computationally expensive.
GRPO (Group Relative Policy Optimization)	`sample G outputs → normalize rewards`	• Samples a group of responses per prompt and estimates advantages by normalizing rewards within the group • eliminates the critic model, halving memory vs. PPO • used in DeepSeek-R1.
DPO (Direct Preference Optimization)	`directly optimize preferences`	• Bypasses reward model by directly optimizing policy on preference pairs using Bradley-Terry model • simpler and more stable than RLHF with comparable results.
SimPO (Simple Preference Optimization)	`avg log-prob as implicit reward`	• Removes the reference model by using average log-probability of a response as implicit reward • outperforms DPO by 6+ points on AlpacaEval 2 with no extra memory cost.
ORPO (Odds Ratio Preference Optimization)	`combine SFT + preference in one loss`	• Merges SFT and preference alignment into a single training objective using odds ratios • eliminates reference model and separate SFT stage—one pass instead of two.
KTO (Kahneman-Tversky Optimization)	`thumbs-up / thumbs-down labels`	• Works with binary feedback instead of pairwise preferences • cheaper data collection for production systems with like/dislike signals • derived from prospect theory.
RLVR (RL with Verifiable Rewards)	`unit test pass / math checker → reward`	• Uses programmatic verifiers (unit tests, math checkers) as reward signal instead of human labels • enables emergent self-reflection and verification for math and code tasks.
DAPO (Dynamic Advantage Policy Optimization)	`clip-higher + token-level loss`	• Extends GRPO for long chain-of-thought reasoning with entropy collapse prevention, dynamic sampling, and token-level policy gradient • 50% fewer steps than DeepSeek-R1-Zero.
Constitutional AI	`self-critique + revision`	• Model critiques its own outputs against constitutional principles then revises • reduces reliance on human feedback for harmlessness alignment.
RLAIF (RL from AI Feedback)	`AI-generated preferences`	• Replaces human labelers with AI-generated feedback for reward model training • scales preference data collection • effective for capability-focused alignment.
Reward modeling	`score(response) from pairs`	• Trains classifier to predict human preference between response pairs • converts subjective preferences into scalar reward signal for RL optimization.

Table 8: Inference Optimization Techniques

Serving an LLM cheaply and fast is its own engineering discipline, and these techniques attack the bottlenecks that make autoregressive generation slow and memory-hungry. KV and prefix caching avoid recomputing work across tokens and requests, Flash Attention and PagedAttention squeeze far more out of GPU memory, and speculative methods like draft models and Medusa generate several tokens at once to break the one-token-at-a-time barrier. Together they're what make production inference economically viable.

Technique	Example	Description
KV caching	`cache computed K, V`	• Stores previously computed key-value pairs during autoregressive generation • avoids redundant computation—essential for production inference efficiency.
Prefix caching	`cache system prompt KV once`	• Reuses KV cache for identical prompt prefixes across requests • skips re-computing static content like system prompts • up to 90% cost and 85% latency reduction for long prompts.
Flash Attention	`tiling + recomputation`	• IO-aware attention algorithm using block-wise computation and kernel fusion • reduces memory bandwidth by 5–20× enabling 2–4× speedup on long sequences.
PagedAttention	`block-level memory management`	• Manages KV cache in non-contiguous blocks like OS paging • reduces memory fragmentation and increases batch size in vLLM by 2–24×.
Speculative decoding	`draft model → verify in parallel`	• Small draft model generates candidate tokens that large model verifies in parallel • 2–3× speedup without changing output distribution.
Medusa	`extra decoding heads predict t+1, t+2`	• Adds multiple prediction heads to an LLM and verifies candidates via tree-based attention in parallel • 2.2–3.6× speedup without a separate draft model.
Continuous batching	`dynamic request batching`	Evicts finished sequences and immediately adds new ones—in-flight batching—maximizes GPU utilization vs. static batching.
Quantization (INT8/FP8)	`convert FP16 → INT8`	• Reduces precision of weights/activations to 8-bit or lower • 2–4× memory reduction and speedup with <1% accuracy loss using calibration.
Flash Attention 2/3	`improved kernel fusion`	• Enhanced IO-aware kernels with lower SRAM usage and better parallelization • Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations.

Table 9: Sampling and Decoding Strategies

A model outputs a probability distribution over the next token; the decoding strategy decides which token actually gets picked, and it's the single biggest lever on whether output feels robotic or creative. Greedy decoding always grabs the top choice and tends toward repetition, while temperature, top-p, top-k, and min-p sampling inject controlled randomness to keep generation lively without going off the rails. Beam and contrastive search take different routes for tasks like translation that reward coherence over surprise.

Strategy	Example	Description
Greedy decoding	`argmax(logits)`	• Selects highest probability token at each step • deterministic and fast but prone to repetitive, low-quality outputs for creative tasks.
Temperature sampling	`logits / T before softmax`	• Scales logits by T—lower T (0.1–0.7) sharpens distribution • higher T (1.0–2.0) increases randomness/creativity.
Top-p (nucleus) sampling	`cumulative prob >= p=0.9`	• Dynamically selects smallest set of tokens whose cumulative probability exceeds threshold p • adapts to distribution shape—preferred for open-ended generation.
Top-k sampling	`sample from top k=40 tokens`	• Restricts sampling to k highest-probability tokens • prevents sampling rare tokens but fixed k can be too restrictive or permissive depending on distribution.
Min-p sampling	`filter tokens < min_p * max_prob`	• Removes tokens with probability below min_p fraction of the maximum token probability • better than top-p for maintaining quality while allowing creativity.
Beam search	`keep top-k sequences`	• Maintains k parallel hypotheses and expands most probable • balances quality and diversity but computationally expensive • common for translation.
Contrastive search	`argmax(model_score - α·cos_sim)`	• Selects tokens that are probable but distinct from the previous context via cosine similarity penalty • reduces repetition while maintaining coherence.

Table 10: Training Optimizations

Training a large model bumps into the hard limits of GPU memory and time, and these techniques are the everyday tricks for staying within them. Mixed precision and gradient checkpointing trade bits and recomputation for room to fit bigger models, gradient accumulation fakes a larger batch size, and the optimizer-and-schedule choices — AdamW with warmup and cosine decay — are the near-universal recipe that keeps large-model training from diverging.

Technique	Example	Description
Mixed precision (FP16/BF16)	`store in FP16, compute in FP32`	• Uses 16-bit floats for storage/computation with FP32 master weights • reduces memory by ~2× and accelerates training with Tensor Cores—BF16 preferred for stability.
Gradient checkpointing	`recompute activations`	• Trades compute for memory by recomputing forward activations during backward pass instead of storing • enables 2–10× larger models at 20–30% slowdown.
Gradient accumulation	`accumulate over N steps`	• Simulates larger effective batch size by accumulating gradients across microbatches before update • crucial for training large models on limited memory.
AdamW optimizer	`decouple weight decay`	• Fixes weight decay in Adam by applying it directly to weights, not gradient • improves generalization • standard optimizer for transformer training.
Learning rate warmup	`linear increase 0 → max_lr`	• Gradually increases learning rate from zero over initial steps (typically 2–10% of training) • stabilizes training of large models and prevents divergence.
Cosine annealing	`lr = min + 0.5(max-min)(1+cos)`	• Decreases learning rate following cosine curve • smooth decay helps model converge to flatter minima • often combined with warmup.
GaLore (Gradient Low-Rank Projection)	`project grad to low-rank subspace`	• Projects gradients into a low-rank subspace via periodic SVD • enables full-rank training with optimizer memory footprint comparable to LoRA • no inference overhead.

Table 11: Distributed Training Strategies

No single GPU can hold a frontier model, so training is spread across many — and these strategies differ in what they split. Data parallelism replicates the model and splits the batch; tensor and pipeline parallelism slice the model itself, across layers or within them; and ZeRO/FSDP shard the optimizer states and parameters to wring out memory. Real large-scale runs combine several of these at once, so understanding what each one partitions is the key to reading any training setup.

Strategy	Example	Description
Data parallelism	`replicate model across GPUs`	• Each GPU holds full model copy processing different data batches • gradients averaged across devices • simplest parallelism but memory-limited by single GPU.
Tensor parallelism	`split layers across GPUs`	• Partitions individual layers (e.g., attention heads) within model across devices • requires frequent all-reduce communication • used in Megatron-LM for very large models.
ZeRO (Zero Redundancy Optimizer)	`shard optimizer states`	• Partitions optimizer states, gradients, parameters across devices with on-demand gathering • ZeRO-3 achieves model parallelism memory efficiency with data parallelism simplicity.
FSDP (Fully Sharded Data Parallel)	`PyTorch native ZeRO-3`	• PyTorch implementation of ZeRO-3 sharding • automatically manages parameter gathering/scattering • simpler API than DeepSpeed for distributed training.
Pipeline parallelism	`layer stages on different GPUs`	• Splits model vertically by layers creating pipeline stages • microbatching reduces bubble overhead • GPipe, PipeDream frameworks.
Sequence parallelism	`partition along sequence dim`	• Splits sequence length across GPUs for memory-constrained layers like LayerNorm • extends tensor parallelism reducing activation memory in long sequences.

Table 12: Model Compression Techniques

Compression shrinks a trained model so it runs faster and cheaper, ideally without losing much accuracy. Quantization drops weights to lower precision, distillation trains a small student to imitate a large teacher, and pruning cuts out weights or whole components that contribute little — each making a different trade between how much you save and how much retraining or calibration it costs.

Technique	Example	Description
Post-training quantization	`GPTQ, AWQ`	• Converts trained model weights to lower precision (INT4/8) without retraining • calibration on small dataset • 3–4× compression with minimal accuracy loss.
Knowledge distillation	`student learns from teacher`	• Trains small student model to match outputs or intermediate representations of large teacher • compresses model while retaining capabilities—DistilBERT example.
Pruning (structured/unstructured)	`remove low-magnitude weights`	• Removes unimportant weights or entire structured components (heads, layers) • can reduce parameters 40–60% but requires careful calibration or retraining.
Quantization-aware training (QAT)	`simulate quantization during training`	• Inserts fake quantization operations in forward pass to model precision effects • typically achieves better accuracy than post-training methods.
Low-rank decomposition	`W ≈ UV` (rank r)	• Approximates weight matrices as product of low-rank matrices • reduces parameters in linear layers • basis of LoRA and similar PEFT methods.

Table 13: Context Window and Long-Context Techniques

A model can only attend over so many tokens at once, and attention's quadratic cost makes simply growing that window expensive — so these techniques extend a model's effective reach in two complementary ways. RAG sidesteps the limit by retrieving relevant text on demand, while RoPE scaling, position interpolation, and YaRN stretch the position encoding to handle far longer sequences than training; sliding-window and sparse attention cut the quadratic cost so the longer context is actually affordable to run.

Technique	Example	Description
Retrieval-Augmented Generation (RAG)	`retrieve docs → augment prompt`	• Retrieves relevant documents from external knowledge base and injects into context • extends effective knowledge beyond model limits • requires good retrieval system.
RoPE scaling	`adjust rotation frequencies`	• Modifies RoPE base frequency to extrapolate to longer sequences • simple method enabling 2–4× context extension with minimal fine-tuning.
Position interpolation	`compress position indices`	• Interpolates positions within training range rather than extrapolating • better stability than direct extrapolation for extended context.
YaRN (Yet another RoPE extensioN)	`scale + adjust NTK base`	• Combines NTK-aware scaling with attention temperature adjustment per head • efficient context extension from 4K to 128K+ tokens.
Sliding window attention	`attend to local window`	• Each token attends only to fixed-size window around its position • linear memory but limited long-range modeling • used in Longformer.
Sparse attention	`attend to subset of positions`	• Computes attention only for selected position pairs using patterns (local, strided, global) • reduces O(n²) complexity enabling 10×+ longer sequences.
Recurrent memory	`compress past into memory`	• Summarizes earlier context into compressed memory state • enables unbounded context in theory but loses fine-grained information from distant past.

Table 14: Prompt Engineering Techniques

Prompting is how you steer a frozen model at inference time, no training required — and a surprising amount of capability is unlocked just by how you ask. Few-shot and zero-shot prompting set the baseline, while the reasoning-oriented techniques are the real workhorses: chain-of-thought coaxes the model to show its steps, self-consistency votes across multiple attempts, and ReAct and tree-of-thoughts add tool use and search for genuinely hard problems.

Technique	Example	Description
Few-shot prompting	`Example1, Example2, ... Query`	• Demonstrates task through 2–10 input-output examples in prompt • exploits in-context learning • effectiveness grows with model scale • examples should be diverse.
Zero-shot prompting	`"Translate to French: [text]"`	• Provides task instruction only without examples • relies on pre-training and instruction tuning • quality highly dependent on model capabilities and prompt clarity.
Chain-of-thought (CoT)	`"Let's think step by step"`	• Prompts model to generate intermediate reasoning steps before final answer • dramatically improves performance on math, logic, commonsense reasoning.
Self-consistency	`sample multiple paths → vote`	• Generates multiple reasoning paths then selects most consistent answer via majority voting • improves reliability over single-path CoT.
ReAct (Reasoning + Acting)	`Thought → Action → Observation`	• Interleaves reasoning and tool use • model generates thoughts, selects actions (API calls, searches), observes results iteratively until solution found.
Tree-of-thoughts	`explore reasoning tree`	• Explores multiple reasoning branches with backtracking and evaluation • enables deliberate problem-solving for complex tasks requiring search.
Skeleton-of-thought	`outline → parallel expand`	• Generates a skeleton outline first then expands each point in parallel • reduces end-to-end latency by up to 2× on modern hardware.

Table 15: Emergent Capabilities and Scaling

This is the theory behind why making models bigger keeps working. Scaling laws predict how loss falls as a power law with compute, data, and size, and the Chinchilla result refined that into how to balance the budget — revealing that many models were badly undertrained. The more surprising entries are in-context learning and emergent abilities: capabilities like few-shot learning and reasoning that appear seemingly out of nowhere once a model crosses a certain scale.

Concept	Example	Description
Scaling laws	`L(N) ∝ N^(-α)`	• Loss scales as power law with compute, model size, dataset size • predicts training compute allocation • vocabulary size also affects optimal scaling.
Compute-optimal scaling	`Chinchilla scaling`	• For fixed compute budget, balanced scaling of model size and training tokens is optimal • suggests many models are undertrained relative to size.
In-context learning	`few-shot without gradients`	• Ability to learn new tasks from examples in prompt without parameter updates • improves with scale • mechanism may involve induction heads in attention layers.
Emergent abilities	`reasoning, arithmetic`	• Capabilities that appear suddenly at scale not present in smaller models • includes in-context learning and chain-of-thought reasoning • debated whether truly emergent or metric artifacts.
Transfer learning	`pre-train → fine-tune`	• Pre-trained models encode general language understanding transferable to downstream tasks • foundation of modern NLP—larger models transfer better.

Table 16: Activation Functions in Transformers

The activation function inside each feed-forward layer is a small choice with measurable effect on quality. The field has drifted from the simple ReLU of the original Transformer toward smoother and gated variants — GELU in BERT and GPT-2, and the gated SwiGLU and GeGLU that power LLaMA and PaLM because they consistently squeeze out a bit more performance, at the cost of slightly larger layers.

Function	Example	Description
SwiGLU	`SwiGLU(x) = Swish(xW) ⊙ (xV)`	• Gated variant using Swish activation (x·sigmoid(x)) with element-wise gating • used in LLaMA, PaLM—empirically outperforms GELU • requires ~50% more FFN parameters for same hidden size.
GELU (Gaussian Error Linear Unit)	`GELU(x) = x·Φ(x)`	• Smooth approximation applying Gaussian CDF • used in BERT, GPT-2 • better gradient properties than ReLU • probabilistic interpretation as neuron dropout.
GeGLU	`GeGLU(x) = GELU(xW) ⊙ (xV)`	• Similar to SwiGLU but uses GELU for gating • strong performance on language tasks • used in T5 variants.
ReLU (Rectified Linear Unit)	`ReLU(x) = max(0, x)`	• Simple piecewise linear function • original Transformer used ReLU • computationally efficient but can suffer from dead neurons • largely replaced in modern LLMs.

Table 17: Normalization Techniques

Normalization keeps activations in a stable range so deep transformers can train without blowing up, and two questions dominate the design: which normalizer, and where to put it. RMSNorm has largely displaced classic LayerNorm in modern LLMs for being faster at comparable quality, while the Pre-LN versus Post-LN placement decides how stable training is and whether you need a warmup. Dropout, once standard, fades in the largest models that are data-starved rather than overfit.

Method	Example	Description
RMSNorm (Root Mean Square Normalization)	`RMSNorm(x) = x / RMS(x) · γ`	• Simplified LN removing mean centering—only normalizes by RMS • 10–20% faster than LN with comparable performance • used in LLaMA, Grok, Qwen3.
Layer Normalization (LN)	`LN(x) = (x - μ) / σ · γ + β`	• Normalizes across feature dimension for each token independently • standard in transformers • mean/variance computed per-sample allowing any batch size.
Pre-LN vs Post-LN	`Pre: LN(x) → Sublayer`	• Pre-LN applies normalization before sublayer (modern default—more stable) • Post-LN applies after (original Transformer—requires warmup) • Pre-LN enables easier convergence.
Dropout	`randomly zero with p=0.1`	• Randomly drops activations during training as regularization • less common in very large LLMs which are underparameterized relative to data • typical rates 0.1–0.2.

Table 18: Evaluation Metrics and Benchmarks

Knowing how good a model actually is means picking the right yardstick, and these fall into two camps. Benchmarks like MMLU, GPQA, HumanEval, and SWE-bench probe knowledge, science, and coding ability — though many are nearing saturation or risk contamination, which is why contamination-resistant and human-preference evaluations (LiveBench, Chatbot Arena) have gained ground. The classic automatic metrics — perplexity, BLEU, ROUGE, BERTScore — measure narrower properties like fluency, translation overlap, and summarization quality.

Metric	Example	Description
MMLU (Massive Multitask Language Understanding)	`57 subjects, 4-way multiple choice`	• Tests knowledge and reasoning across STEM, humanities, social sciences • standard benchmark for general capabilities • 0–100% accuracy.
GPQA Diamond	`448 PhD-level science questions`	• Tests doctoral-level knowledge in biology, physics, chemistry • designed to be hard even with internet access • nearing saturation for frontier models (~94% as of 2026).
HumanEval	`code synthesis benchmark`	• Evaluates code generation with 164 hand-written programming problems • pass@k metric measures functional correctness • standard for coding models.
SWE-bench	`GitHub issue → code fix`	• Tests real-world software engineering—resolving GitHub issues in Python repos • measures fraction of issues resolved • key agentic coding benchmark.
LiveBench	`monthly refreshed questions`	• Contamination-resistant benchmark with questions refreshed monthly using recent data sources • covers math, coding, reasoning, language, data analysis.
Arena ELO / Chatbot Arena	`human preference pairwise ranking`	• Crowdsourced pairwise preference evaluation • ELO-rated from millions of blind votes • strong signal for real-world conversational quality.
Perplexity	`PPL = exp(avg_loss)`	• Measures how surprised model is by test data • lower is better • exponential of average cross-entropy loss • standard language modeling metric.
BLEU	`n-gram precision with brevity`	• Compares n-gram overlap between generated and reference translations • 0–100 scale • standard for machine translation evaluation.
ROUGE	`ROUGE-L, ROUGE-N`	• Measures recall-oriented n-gram overlap • primarily for summarization • ROUGE-L uses longest common subsequence.
BERTScore	`contextual embedding similarity`	• Computes token similarity using BERT embeddings rather than exact matches • captures semantic similarity better than n-gram metrics.

Table 19: Attention Mechanism Optimizations

Standard multi-head attention is expensive at inference, mostly because of the memory the KV cache consumes, so these optimizations rework attention to be cheaper. GQA and MQA shrink the cache by sharing keys and values across heads, MLA compresses them into a low-rank latent (the trick behind DeepSeek-V3), and sliding-window, linear, and Flash Attention attack the quadratic cost from different angles — most of them trading a sliver of quality for major savings in speed and memory.

Optimization	Example	Description
Grouped-query attention (GQA)	`heads share K, V in groups`	• Compromise between MQA and MHA: groups of heads share K, V • balances quality-efficiency tradeoff • used in LLaMA-2/3, Mistral, Qwen3.
Multi-Head Latent Attention (MLA)	`compress KV to latent → cache`	• Compresses keys and values into a low-rank latent vector (joint KV compression) before caching • reduces KV cache by up to 93% vs. MHA with better modeling quality • used in DeepSeek-V3, Kimi K2.
Multi-query attention (MQA)	`single K, V across heads`	• Shares same key-value projections across all heads using only multiple queries • reduces KV cache memory and speeds inference but slightly lower quality than GQA.
Flash Attention 2/3	`improved kernel fusion`	• Enhanced IO-aware kernels with lower SRAM usage and better parallelization • Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations.
Sliding window attention	`attend to k nearest tokens`	• Restricts attention to fixed-size local window • reduces complexity to O(n·k) from O(n²) • enables longer sequences but limited global context.
Linear attention	`kernel trick approximation`	• Approximates softmax attention using kernel methods reducing complexity to O(n) • enables efficient very long sequences but quality gaps remain vs. full attention.

Table 20: Advanced Training Techniques

Beyond the standard recipe, these techniques shape what and in what order a model learns. Multi-task and curriculum learning structure the training signal — sharing parameters across tasks, or sequencing easy examples before hard ones — while continual learning tackles the problem of updating a model on new data without erasing what it already knew. Contrastive learning and data augmentation round out the toolkit for representation quality and squeezing more from limited data.

Technique	Example	Description
Multi-task learning	`train on multiple tasks jointly`	• Shares parameters across tasks expecting positive transfer • T5 frames everything as text-to-text • requires balanced sampling and task weighting.
Curriculum learning	`easy → hard examples`	• Orders training data from simple to complex • can improve convergence and final performance • domain-specific curriculum design needed.
Continual learning	`incremental data updates`	• Updates model on new data without forgetting previous knowledge • addresses catastrophic forgetting through rehearsal, regularization, or architectural solutions.
Contrastive learning	`SimCLR, CLIP`	• Learns representations by contrasting positive pairs against negatives • CLIP aligns text-image pairs • effective for self-supervised and multimodal learning.
Data augmentation	`backtranslation, paraphrasing`	• Generates synthetic training variations from existing data • back-translation, EDA, GPT-generated examples • particularly useful for low-resource tasks.

Table 21: Model Merging Techniques

Merging combines two or more fine-tuned models into one without any retraining, blending their abilities just by manipulating weights — a cheap way to fuse, say, a coding model and a chat model. The methods differ in how they reconcile conflicting weights: SLERP interpolates two models geometrically, task arithmetic adds and subtracts capability "vectors," and TIES and DARE resolve interference when merging many models at once, with evolutionary search automating the recipe hunt.

Technique	Example	Description
SLERP (Spherical Linear Interpolation)	`t=0.5` between model A and B	• Smoothly interpolates between two models' weights in spherical space preserving geometric properties • best for high-quality pairwise merges • limited to two models at a time.
TIES-Merging	`trim → elect sign → disjoint merge`	• Three-step process: trim redundant parameters, elect dominant sign direction, merge aligned parameters • handles multi-model merging by resolving parameter conflicts.
DARE (Drop And REscale)	`drop delta weights p=0.9, rescale`	• Randomly drops task-vector delta weights then rescales remaining by 1/(1−p) • effective even dropping 90–99% of deltas • used as augment for TIES or Task Arithmetic.
Task Arithmetic	`task_vector = fine_tuned - pretrained`	• Computes task vectors (delta weights) and combines via arithmetic • add vectors to merge capabilities, negate to remove behaviors • simple and composable.
Passthrough (layer stacking)	`layers 0-32 of A + 24-32 of B`	• Concatenates layers from different models to create frankenmerge with exotic parameter counts (e.g., 9B from two 7B models) • experimental but can produce capable models.
Evolutionary merging	`evolutionary search over merge configs`	• Uses evolutionary algorithms to automatically discover optimal merging recipes and hyperparameters • 50× cost reduction via MERGE³ on single GPU • produces SOTA merged models.
Model Soup (weight averaging)	`average weights of N fine-tuned models`	• Averages weights of multiple fine-tuned versions of same base model • improves accuracy without increasing inference cost • greedy variant evaluates each addition.

Table 22: LLM Agent Concepts

An agent is what you get when an LLM can act in the world rather than just talk — and tool use, the ability to call external functions through structured JSON, is the capability that crosses that line. The rest describe how agents string those actions together: the ReAct loop interleaves reasoning with acting, memory spans a single session or persists across many, planning decomposes big tasks, and multi-agent setups and RLVR-trained agents push toward systems that solve genuinely multi-step problems on their own.

Concept	Example	Description
Tool use / Function calling	`{"function": "search", "args": {...}}`	• LLM selects and invokes external functions (APIs, search, code execution) via structured JSON • the defining capability separating conversational models from agents.
ReAct loop	`Thought → Action → Observation → …`	• Agent iterates Thought → Action → Observation cycles until task is complete • interleaves reasoning and acting for grounded multi-step problem solving.
Agent memory (short/long-term)	`context window + vector store`	• Short-term: in-context window, cleared after session • Long-term: external vector store or database enabling retrieval across sessions.
Planning and decomposition	`task → subtask1, subtask2, ...`	• Agent breaks large tasks into manageable subtasks via chain-of-thought or tree-of-thought • can reflect and revise plans based on intermediate results.
Multi-agent framework	`orchestrator → specialist agents`	• Multiple LLM agents collaborate: one orchestrates, others specialize • improves performance on tasks requiring diverse expertise or parallel execution.
Agentic RAG	`agent decides when/what to retrieve`	• Agent dynamically decides when to retrieve, what to search, and how to use results • contrasts with static one-shot RAG by iterating retrieval based on partial answers.
RLVR for agents	`env reward → policy update`	• Trains agents using verifiable environment rewards (tool execution outcomes, test pass/fail) • enables agents to discover optimal multi-step strategies without human demonstrations.

Back to Generative AI

Next Topic: Llama Models (Meta) Cheat Sheet

References

Official Documentation & Foundational Papers

Attention Is All You Need - https://arxiv.org/abs/1706.03762
BERT: Pre-training of Deep Bidirectional Transformers - https://arxiv.org/abs/1810.04805
Language Models are Few-Shot Learners (GPT-3) - https://arxiv.org/abs/2005.14165
Improving Language Understanding by Generative Pre-Training (GPT) - https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
LLaMA: Open and Efficient Foundation Language Models - https://arxiv.org/abs/2302.13971
Transformer Model Documentation - PyTorch - https://pytorch.org/docs/stable/nn.html#transformer
Hugging Face Transformers Library - https://huggingface.co/docs/transformers/
T5: Exploring the Limits of Transfer Learning - https://arxiv.org/abs/1910.10683
RoBERTa: A Robustly Optimized BERT Pretraining Approach - https://arxiv.org/abs/1907.11692
DeepSeek-V3 Technical Report - https://arxiv.org/abs/2412.19437
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - https://arxiv.org/abs/2501.12948

Architecture and Attention Mechanisms

Multi-Head Attention Explained - d2l.ai - https://d2l.ai/chapter_attention-mechanisms-and-transformers/multihead-attention.html
The Illustrated Transformer - Jay Alammar - https://jalammar.github.io/illustrated-transformer/
FlashAttention: Fast and Memory-Efficient Exact Attention - https://arxiv.org/abs/2205.14135
FlashAttention-2: Faster Attention with Better Parallelism - https://arxiv.org/abs/2307.08691
Efficient Memory Management for Large Language Model Serving (PagedAttention) - https://arxiv.org/abs/2309.06180
Self-Attention with Relative Position Representations - https://arxiv.org/abs/1803.02155
RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) - https://arxiv.org/abs/2104.09864
Train Short, Test Long: Attention with Linear Biases (ALiBi) - https://arxiv.org/abs/2108.12409
Multi-Query Attention for Faster Inference - https://arxiv.org/abs/1911.02150
GQA: Training Generalized Multi-Query Transformer - https://arxiv.org/abs/2305.13245
Multi-Head Latent Attention (MLA) - Sebastian Raschka - https://sebastianraschka.com/llm-architecture-gallery/mla/
DeepSeek-V2: Multi-Head Latent Attention Paper - https://arxiv.org/abs/2405.04434
LLM Architecture Gallery 2026 - SesameDisK - https://sesamedisk.com/llm-architecture-gallery-2026/
The Inner Workings of DeepSeek-V3 - Chris McCormick - https://mccormickml.com/2025/02/12/the-inner-workings-of-deep-seek-v3/
NoPE: No Positional Encoding in Transformers - https://arxiv.org/abs/2404.12224
Linear Transformer (Linear Attention) - https://arxiv.org/abs/2006.16236
On the Relationship between Self-Attention and Convolutional Layers - https://arxiv.org/abs/1911.03584
QK-Norm in Transformers - Scaling ViT - https://arxiv.org/abs/2302.05442

Tokenization and Preprocessing

Neural Machine Translation of Rare Words with Subword Units (BPE) - https://arxiv.org/abs/1508.07909
Google's Neural Machine Translation System (WordPiece) - https://arxiv.org/abs/1609.08144
SentencePiece: A simple and language independent approach - https://arxiv.org/abs/1808.06226
Subword Regularization: Improving Neural Network Translation (Unigram) - https://arxiv.org/abs/1804.10959
Byte Pair Encoding Implementation Guide - Hugging Face - https://huggingface.co/learn/nlp-course/chapter6/5

Training and Optimization

Decoupled Weight Decay Regularization (AdamW) - https://arxiv.org/abs/1711.05101
Mixed Precision Training - https://arxiv.org/abs/1710.03740
Training with Gradient Checkpointing - https://arxiv.org/abs/1604.06174
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (Warmup) - https://arxiv.org/abs/1706.02677
SGDR: Stochastic Gradient Descent with Warm Restarts - https://arxiv.org/abs/1608.03983
Scaling Laws for Neural Language Models - https://arxiv.org/abs/2001.08361
Training Compute-Optimal Large Language Models (Chinchilla) - https://arxiv.org/abs/2203.15556
Scaling Laws with Vocabulary Size - https://arxiv.org/abs/2407.13623
Layer Normalization - https://arxiv.org/abs/1607.06450
Root Mean Square Layer Normalization (RMSNorm) - https://arxiv.org/abs/1910.07467
On Layer Normalization in the Transformer Architecture (Pre-LN) - https://arxiv.org/abs/2002.04745
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection - https://arxiv.org/abs/2403.03507
Multi-Token Prediction (Better & Faster LLMs) - https://arxiv.org/abs/2404.19737

Fine-Tuning and Adaptation

LoRA: Low-Rank Adaptation of Large Language Models - https://arxiv.org/abs/2106.09685
QLoRA: Efficient Finetuning of Quantized LLMs - https://arxiv.org/abs/2305.14314
DoRA: Weight-Decomposed Low-Rank Adaptation - https://arxiv.org/abs/2402.09353
Parameter-Efficient Transfer Learning (Adapter Modules) - https://arxiv.org/abs/1902.00751
Prefix-Tuning: Optimizing Continuous Prompts - https://arxiv.org/abs/2101.00190
The Power of Scale for Parameter-Efficient Prompt Tuning - https://arxiv.org/abs/2104.08691
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning - https://arxiv.org/abs/2110.07602
Finetuned Language Models Are Zero-Shot Learners (FLAN) - https://arxiv.org/abs/2109.01652
Scaling Instruction-Finetuned Language Models - https://arxiv.org/abs/2210.11416
Hugging Face PEFT Library - https://github.com/huggingface/peft

Alignment and RLHF

Training Language Models to Follow Instructions (InstructGPT/RLHF) - https://arxiv.org/abs/2203.02155
Direct Preference Optimization (DPO) - https://arxiv.org/abs/2305.18290
SimPO: Simple Preference Optimization with a Reference-Free Reward - https://arxiv.org/abs/2405.14734
ORPO: Monolithic Preference Optimization without Reference Model - https://arxiv.org/abs/2403.07691
KTO: Model Alignment as Prospect Theoretic Optimization - https://arxiv.org/abs/2402.01306
Constitutional AI: Harmlessness from AI Feedback - https://arxiv.org/abs/2212.08073
RLAIF: Scaling Reinforcement Learning from Human Feedback - https://arxiv.org/abs/2309.00267
Learning to Summarize from Human Feedback - https://arxiv.org/abs/2009.01325
Training a Helpful and Harmless Assistant with RLHF - https://arxiv.org/abs/2204.05862
DeepSeekMath: Pushing the Limits of Mathematical Reasoning (GRPO) - https://arxiv.org/abs/2402.03300
DAPO: An Open-Source LLM Reinforcement Learning System - https://arxiv.org/abs/2503.14476
Post-Training in 2026: GRPO, DAPO, RLVR & Beyond - https://llm-stats.com/blog/research/post-training-techniques-2026
DPO Variants: IPO, KTO, ORPO - https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment
Kimi k1.5: Scaling Reinforcement Learning with LLMs - https://arxiv.org/abs/2501.12599
OLMo 2: Fully Open Language Models - https://arxiv.org/abs/2501.00656
Group Relative Policy Optimization (GRPO) - Illustrated Breakdown - https://epichka.com/blog/2025/grpo/
GRPO Deep Dive - Cameron Wolfe - https://cameronrwolfe.substack.com/p/grpo

Inference and Optimization

Fast Inference from Transformers via Speculative Decoding - https://arxiv.org/abs/2211.17192
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads - https://arxiv.org/abs/2401.10774
GPTQ: Accurate Post-Training Quantization - https://arxiv.org/abs/2210.17323
AWQ: Activation-aware Weight Quantization - https://arxiv.org/abs/2306.00978
SmoothQuant: Accurate and Efficient Post-Training Quantization - https://arxiv.org/abs/2211.10438
LLM.int8(): 8-bit Matrix Multiplication for Transformers - https://arxiv.org/abs/2208.07339
Continuous Batching for LLM Inference - https://www.anyscale.com/blog/continuous-batching-llm-inference
Prefix Caching - BentoML LLM Inference Handbook - https://bentoml.com/llm/inference-optimization/prefix-caching
LLM Inference Optimization Techniques - Redwerk - https://redwerk.com/blog/llm-inference-optimization-techniques/
Prompt Caching: Up to 90% Cost Reduction - https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
KV Cache Optimization Guide - https://blog.dailydoseofds.com/p/a-practical-deep-dive-on-llm-inference
LLM Inference Optimization Guide - Morphllm - https://www.morphllm.com/llm-inference-optimization

Distributed Training

Megatron-LM: Training Multi-Billion Parameter Language Models - https://arxiv.org/abs/1909.08053
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism - https://arxiv.org/abs/1811.06965
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models - https://arxiv.org/abs/1910.02054
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel - https://arxiv.org/abs/2304.11277
Reducing Activation Recomputation in Large Transformer Models (Sequence Parallelism) - https://arxiv.org/abs/2205.05198

Model Compression

Distilling the Knowledge in a Neural Network - https://arxiv.org/abs/1503.02531
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot - https://arxiv.org/abs/2301.00774
Wanda: A Simple and Effective Pruning Approach for LLMs - https://arxiv.org/abs/2306.11695

Long Context and RAG

Retrieval-Augmented Generation for Knowledge-Intensive NLP - https://arxiv.org/abs/2005.11401
Longformer: The Long-Document Transformer - https://arxiv.org/abs/2004.05150
Extending Context Window via Position Interpolation - https://arxiv.org/abs/2306.15595
YaRN: Efficient Context Window Extension - https://arxiv.org/abs/2309.00071
Lost in the Middle: How Language Models Use Long Contexts - https://arxiv.org/abs/2307.03172

Prompt Engineering

Chain-of-Thought Prompting Elicits Reasoning - https://arxiv.org/abs/2201.11903
Self-Consistency Improves Chain of Thought Reasoning - https://arxiv.org/abs/2203.11171
Tree of Thoughts: Deliberate Problem Solving - https://arxiv.org/abs/2305.10601
ReAct: Synergizing Reasoning and Acting in Language Models - https://arxiv.org/abs/2210.03629
Skeleton-of-Thought: LLMs Can Do Parallel Decoding - https://arxiv.org/abs/2307.15337
The Prompt Report: A Systematic Survey - https://arxiv.org/abs/2406.06608
Prompt Engineering Guide - https://www.promptingguide.ai/

Sampling and Decoding

The Curious Case of Neural Text Degeneration (Nucleus Sampling) - https://arxiv.org/abs/1904.09751
Hierarchical Neural Story Generation (Top-k) - https://arxiv.org/abs/1805.04833
Contrastive Search for Better Language Generation - https://arxiv.org/abs/2210.14140
Min-p Sampling: Balancing Quality and Diversity - https://arxiv.org/abs/2407.01082

Emergent Capabilities and Scaling

Emergent Abilities of Large Language Models - https://arxiv.org/abs/2206.07682
Are Emergent Abilities a Mirage? - https://arxiv.org/abs/2304.15004
In-context Learning and Induction Heads - https://arxiv.org/abs/2209.11895
A Survey of Large Language Models - https://arxiv.org/abs/2303.18223

Multimodal and Vision-Language

Learning Transferable Visual Models From Natural Language (CLIP) - https://arxiv.org/abs/2103.00020
Flamingo: a Visual Language Model for Few-Shot Learning - https://arxiv.org/abs/2204.14198
Visual Instruction Tuning (LLaVA) - https://arxiv.org/abs/2304.08485
Gemini: A Family of Highly Capable Multimodal Models - https://arxiv.org/abs/2312.11805

Activation Functions

Gaussian Error Linear Units (GELUs) - https://arxiv.org/abs/1606.08415
GLU Variants Improve Transformer (SwiGLU/GeGLU) - https://arxiv.org/abs/2002.05202
Swish: A Self-Gated Activation Function - https://arxiv.org/abs/1710.05941

Architecture Variants

Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer - https://arxiv.org/abs/1701.06538
Switch Transformers: Scaling to Trillion Parameter Models - https://arxiv.org/abs/2101.03961
Mixtral of Experts - https://arxiv.org/abs/2401.04088
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts - https://arxiv.org/abs/2112.06905

Evaluation and Benchmarks

Measuring Massive Multitask Language Understanding (MMLU) - https://arxiv.org/abs/2009.03300
GPQA: A Graduate-Level Google-Proof Q&A Benchmark - https://arxiv.org/abs/2311.12022
GPQA Benchmark Scores 2026 - BenchLM.ai - https://benchlm.ai/benchmarks/gpqa
Evaluating Large Language Models Trained on Code (HumanEval) - https://arxiv.org/abs/2107.03374
SWE-bench: Can Language Models Resolve Real GitHub Issues? - https://arxiv.org/abs/2310.06770
LiveBench: A Challenging, Contamination-Free LLM Benchmark - https://arxiv.org/abs/2406.19314
LiveBench Leaderboard - https://livebench.ai/
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference - https://chat.lmsys.org/
BERTScore: Evaluating Text Generation with BERT - https://arxiv.org/abs/1904.09675
BLEU: a Method for Automatic Evaluation of Machine Translation - https://aclanthology.org/P02-1040/
ROUGE: A Package for Automatic Evaluation of Summaries - https://aclanthology.org/W04-1013/

Model Merging

Merge Large Language Models with mergekit - Hugging Face Blog - https://huggingface.co/blog/mlabonne/merge-models
TIES-Merging: Resolving Interference When Merging Models - https://arxiv.org/abs/2306.01708
Language Models are Super Mario: Absorbing Abilities with DARE - https://arxiv.org/abs/2311.03099
Editing Models with Task Arithmetic - https://arxiv.org/abs/2212.04089
Model Soups: Averaging Weights of Multiple Fine-Tuned Models - https://arxiv.org/abs/2203.05482
Evolutionary Optimization of Model Merging Recipes - Nature Machine Intelligence - https://www.nature.com/articles/s42256-024-00975-8
An Introduction to Model Merging for LLMs - NVIDIA Technical Blog - https://developer.nvidia.com/blog/an-introduction-to-model-merging-for-llms/
mergekit - Arcee AI - https://github.com/arcee-ai/mergekit
Model Merging for LLMs 2026 - Zylos Research - https://zylos.ai/research/2026-01-24-model-merging-llm

Agentic AI and Tool Use

ReAct: Synergizing Reasoning and Acting - https://arxiv.org/abs/2210.03629
LLM Agents: The Ultimate Guide 2026 - SuperAnnotate - https://www.superannotate.com/blog/llm-agents
Agentic Artificial Intelligence: Architectures, Taxonomies - https://arxiv.org/html/2601.12560v1
Tool Use and Function Calling in AI Agents 2026 - Zylos Research - https://zylos.ai/research/2026-04-07-tool-use-function-calling-standards-benchmarks

Technical Blogs and Tutorials

The Illustrated Transformer - Jay Alammar - https://jalammar.github.io/illustrated-transformer/
Understanding and Coding Self-Attention - Sebastian Raschka - https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention
LLM Training Guide - Hugging Face (StackLLaMA) - https://huggingface.co/blog/stackllama
DeepSpeed Documentation - Microsoft - https://www.deepspeed.ai/
Megatron-LM Training Guide - NVIDIA - https://docs.nvidia.com/megatron-core/
vLLM Inference Server - UC Berkeley - https://docs.vllm.ai/
Understanding Encoder and Decoder LLMs - Sebastian Raschka - https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder
Flash Attention Explained - DataCamp - https://www.datacamp.com/blog/flash-attention
LLMs in 2026: What's Real, What's Hype - Infotech - https://www.infotech.com/digital-disruption/llms-in-2026-what-s-real-what-s-hype-and-what-s-coming-next
Large Language Models and AI Engineering in 2026 - The AI Cowboys - https://theaicowboys.com/blog/large-language-models-llms-ai-engineering-2026

Advanced Topics

Contrastive Learning with SimCLR - https://arxiv.org/abs/2002.05709
Knowledge Distillation Survey - https://arxiv.org/abs/2006.05525
Curriculum Learning for NLP - https://arxiv.org/abs/2101.10382
Continual Learning for LLMs - https://arxiv.org/abs/2302.00487
A Survey on In-context Learning - https://arxiv.org/abs/2301.00234
State Space Models (Mamba) - https://arxiv.org/abs/2312.00752
Transformer Quality in Linear Time - https://arxiv.org/abs/2202.10447
Reasoning Models Generate Societies of Thought (DeepSeek-R1) - https://arxiv.org/html/2601.10825v1
DeepSeek-R1 incentivizes reasoning through pure RL - Nature - https://www.nature.com/articles/s41586-025-09422-z

Industry Resources

OpenAI API Documentation - https://platform.openai.com/docs/
Anthropic Claude Documentation - https://docs.anthropic.com/
Google Gemini Technical Report - https://deepmind.google/technologies/gemini/
Meta LLaMA Model Card - https://github.com/facebookresearch/llama
Mistral AI Documentation - https://docs.mistral.ai/
Cohere LLM Documentation - https://docs.cohere.com/
Together AI Platform - https://docs.together.ai/
Weights & Biases LLM Training - https://wandb.ai/site/solutions/llmops

Video Resources

Andrej Karpathy's Neural Networks: Zero to Hero - https://karpathy.ai/zero-to-hero.html
Stanford CS324 - Large Language Models - https://stanford-cs324.github.io/winter2022/
Stanford CS336 Language Modeling from Scratch Spring 2026 - https://www.youtube.com/watch?v=lVynu4bo1rY
DeepLearning.AI LLM Courses - https://www.deeplearning.ai/courses/
How to Train LLMs to Think (o1 & DeepSeek-R1) - YouTube - https://www.youtube.com/watch?v=RveLjcNl0ds

GitHub Repositories

transformers - Hugging Face - https://github.com/huggingface/transformers
llama - Meta AI - https://github.com/facebookresearch/llama
flash-attention - Dao-AILab - https://github.com/Dao-AILab/flash-attention
vllm - UC Berkeley - https://github.com/vllm-project/vllm
DeepSpeed - Microsoft - https://github.com/microsoft/DeepSpeed
Megatron-LM - NVIDIA - https://github.com/NVIDIA/Megatron-LM
peft - Hugging Face - https://github.com/huggingface/peft
axolotl - OpenAccess AI Collective - https://github.com/OpenAccess-AI-Collective/axolotl
llama.cpp - ggerganov - https://github.com/ggerganov/llama.cpp
Medusa - FasterDecoding - https://github.com/FasterDecoding/Medusa
mergekit - Arcee AI - https://github.com/arcee-ai/mergekit

Research Conferences and Archives

NeurIPS 2025 Papers - https://neurips.cc/
ICLR 2026 Papers - https://iclr.cc/
ACL 2026 Findings - https://aclanthology.org/
ICML 2025 Proceedings - https://icml.cc/
arXiv cs.CL Recent Papers - https://arxiv.org/list/cs.CL/recent