Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Large Language Models (LLMs) Cheat Sheet

Large Language Models (LLMs) Cheat Sheet

Tables
Back to Generative AI
Updated 2026-04-28
Next Topic: LlamaIndex Cheat Sheet

Large Language Models are transformer-based neural networks trained on massive text datasets to generate, understand, and manipulate human language at scale. At their core, LLMs use self-attention mechanisms to capture contextual relationships between tokens, enabling them to perform tasks ranging from translation and summarization to code generation and complex reasoning. The field has evolved rapidly—from foundational pre-training on trillions of tokens, through specialized fine-tuning and alignment techniques, to sophisticated reasoning models trained with reinforcement learning from verifiable rewards. A key insight: LLMs don't simply memorize text—their emergent abilities to reason in-context and solve unseen problems arise from the interplay of architecture, scale, and post-training dynamics, with 2025–2026 marking a shift toward agentic, multimodal, and long-context systems.

Quick Index147 entries · 22 tables
Mind Map

22 tables, 147 concepts. Select a concept node to jump to its table row.

Preparing mind map...

Table 1: Core Transformer Architecture

ComponentExampleDescription
Self-attention
scores = Q @ K.T / sqrt(d_k)
• Computes query-key-value relationships where each token attends to all positions
• forms the foundation of transformer parallelization by replacing recurrence.
Multi-head attention
heads = 8
Q, K, V = Linear(x, d_k)
• Splits attention into multiple parallel heads learning different representation subspaces
• each head computes scaled dot-product attention independently, then concatenates results.
Causal (masked) attention
mask = torch.triu(ones) * -inf
• Prevents tokens from attending to future positions via upper-triangular mask
• critical for autoregressive generation in decoder-only models like GPT.
Cross-attention
encoder_output → decoder
• Used in encoder-decoder models where decoder queries attend to encoder keys/values
• enables translation and seq2seq by bridging input-output representations.
Feed-forward network (FFN)
FFN(x) = ReLU(xW1 + b1)W2
• Two-layer MLP applied position-wise after attention
• typically expands dimension 4× then projects back, providing non-linearity and feature transformation.
Residual connections
output = x + SubLayer(x)
• Adds input directly to sublayer output enabling gradient flow through deep networks
• prevents vanishing gradients and allows training of 100+ layer transformers.
Layer normalization
LN(x) = γ(x - μ) / σ + β
• Normalizes activations across feature dimension for each token
• stabilizes training and enables higher learning rates.
Positional encoding
PE = sin(pos/10000^(2i/d))
• Injects order information into token embeddings
• the transformer is permutation-invariant without explicit position signals.
QK-Norm
q = RMSNorm(q); k = RMSNorm(k)
• Applies RMS normalization to query and key vectors before dot-product attention
• stabilizes training of very large models and is used in Qwen3, Trinity Large.

Table 2: Positional Encoding Variants

MethodExampleDescription
RoPE (Rotary Position Embedding)
rotate(q) @ rotate(k).T
• Applies rotation matrices to query-key pairs encoding relative position via complex-plane rotation
• used in LLaMA, GPT-NeoX—enables good length extrapolation.
Learned absolute
pos_emb = Embedding(max_len, d_model)
• Trainable position embeddings learned during training
• used in BERT and GPT-2
• limited to max_len seen during training without extrapolation.
Absolute sinusoidal
PE(pos, 2i) = sin(pos/10000^(2i/d))
PE(pos, 2i+1) = cos(pos/10000^(2i/d))
• Original Transformer encoding using fixed sine/cosine functions at different frequencies
• provides unique position signals but doesn't explicitly encode relative distances.
ALiBi (Attention with Linear Biases)
attn + bias * (-1, -2, -3, ...)
• Adds linear penalty to attention scores based on distance
• no position embeddings needed
• excellent extrapolation to longer sequences than training context.
Relative positional encoding
bias = learned_bias[k - q]
• Encodes distance between positions rather than absolute indices
• better length generalization but computationally expensive for long sequences.
NoPE (No Positional Encoding)
no pos embeddings in global layers
• Eliminates positional embeddings in global attention layers entirely
• relies on architectural bias and training for order
• used in SmolLM3 global attention layers.

Table 3: Model Architecture Variants

TypeExampleDescription
Decoder-only
GPT-2, GPT-3, LLaMA, Qwen3
• Causal masked attention for autoregressive generation
• trained to predict next token
• dominates modern LLMs due to scalability and generation quality.
Encoder-only
BERT, RoBERTa
• Bidirectional attention processes entire sequence
• excels at understanding tasks like classification, NER, question answering
• cannot generate text autoregressively.
Encoder-decoder
T5, BART
• Separate encoder (bidirectional) and decoder (causal) with cross-attention
• optimal for seq2seq tasks like translation and summarization.
Mixture of Experts (MoE)
router(x) → top-k experts
• Sparsely activated architecture where gating network routes tokens to a subset of expert FFNs
• scales parameters without proportional compute increase (Mixtral, DeepSeek-V3).
Reasoning model
<think>...</think>\nAnswer
• Decoder-only model trained with RL to produce extended chain-of-thought in a scratchpad before the final answer
• emergent self-reflection and verification (o1, DeepSeek-R1).
Vision-language
CLIP, LLaVA, Gemini
• Integrates vision encoder (ViT) with language model via projection layer or cross-attention
• enables multimodal understanding from images and text jointly.

Table 4: Tokenization Algorithms

AlgorithmExampleDescription
Byte-Pair Encoding (BPE)
"playing" → ["play", "ing"]
• Iteratively merges most frequent character pairs in training corpus
• balances vocabulary size with coverage
• used in GPT-2, GPT-3—stores merge rules.
SentencePiece
treats spaces as token "_"
• Language-agnostic tokenizer operating on raw text without pre-tokenization
• encodes whitespace as special character
• supports BPE and unigram models
• used in T5, LLaMA.
WordPiece
"unaffable" → ["un", "##aff", "##able"]
• Similar to BPE but selects merges based on likelihood maximization rather than frequency
• used in BERT
• saves final vocabulary only, not merge operations.
Unigram language model
probabilistic token selection
• Maintains vocabulary with token probabilities
• removes tokens iteratively to minimize loss
• allows multiple segmentations unlike greedy BPE/WordPiece.
Byte-level BPE
any byte sequence tokenizable
• Operates on raw UTF-8 bytes so any string is tokenizable without unknown tokens
• used in GPT-2 and GPT-4 (tiktoken)
• fully language-agnostic.

Table 5: Pre-training Objectives

ObjectiveExampleDescription
Causal language modeling (CLM)
P(token_i | token_<i)
• Predict next token given left context only
• standard decoder-only objective maximizing likelihood of training sequences
• used in GPT family.
Masked language modeling (MLM)
P([MASK] | context)
• Randomly masks ~15% of tokens and predicts them from bidirectional context
• BERT's core pre-training objective enabling deep bidirectional representations.
Multi-token prediction (MTP)
predict tokens t+1, t+2, t+3
• Trains multiple prediction heads to predict several future tokens simultaneously
• improves sample efficiency and enables speculative decoding at inference (DeepSeek-V3: 1.8× speedup).
Denoising autoencoding
corrupt → reconstruct
• Masks or corrupts spans of text then trains model to reconstruct original
• T5 frames all tasks as text-to-text generation with varying corruption strategies.
Prefix language modeling
bidirectional prefix → causal
• Applies bidirectional attention to prefix then causal attention for continuation
• bridges encoder and decoder benefits (UniLM, GLM).
Contrastive learning
align(text, image)
• Trains model to match positive pairs (text-image) while separating negative pairs
• CLIP uses dual encoders with contrastive loss for vision-language alignment.

Table 6: Fine-Tuning Techniques

TechniqueExampleDescription
Supervised fine-tuning (SFT)
train on (input, output) pairs
• Standard gradient-based training on task-specific labeled data
• updates all parameters to adapt pre-trained model to downstream task.
Instruction tuning
"Translate: [text]" → output
• Fine-tunes on diverse instruction-following datasets formatted as explicit commands
• dramatically improves zero-shot task generalization (FLAN, InstructGPT).
LoRA (Low-Rank Adaptation)
ΔW = BA (rank r << d)
• Freezes base model and trains low-rank decomposition matrices injected into attention layers
• reduces trainable parameters by 10,000× while preserving quality.
QLoRA
4-bit base + LoRA adapters
• Quantizes base model to 4-bit precision and applies LoRA
• enables fine-tuning 65B models on a single 48GB GPU with minimal degradation.
DoRA (Weight-Decomposed LoRA)
W = m · (W₀ + BA) / ‖W₀+BA‖
• Decomposes pre-trained weights into magnitude and direction components
• trains direction via LoRA and magnitude separately
• consistently outperforms LoRA with no extra inference cost.
Adapter modules
insert bottleneck layers
• Adds small trainable feed-forward modules between frozen transformer layers
• modular approach allowing task-specific adapters without full model copies.
Prefix tuning
prepend trainable vectors
• Optimizes continuous task-specific vectors prepended to each layer
• keeps model frozen
• competitive with full fine-tuning on some tasks with 0.1% parameters.
P-tuning / Prompt tuning
optimize soft prompts
• Learns continuous prompt embeddings rather than discrete text
• more parameter-efficient than prefix tuning
• effectiveness increases with model scale.

Table 7: Alignment and RLHF

MethodExampleDescription
RLHF (Reinforcement Learning from Human Feedback)
reward model → PPO training
• Three-stage process: SFT, train reward model on human preferences, optimize policy via PPO
• used in ChatGPT and Claude—computationally expensive.
GRPO (Group Relative Policy Optimization)
sample G outputs → normalize rewards
• Samples a group of responses per prompt and estimates advantages by normalizing rewards within the group
• eliminates the critic model, halving memory vs. PPO
• used in DeepSeek-R1.
DPO (Direct Preference Optimization)
directly optimize preferences
• Bypasses reward model by directly optimizing policy on preference pairs using Bradley-Terry model
• simpler and more stable than RLHF with comparable results.
SimPO (Simple Preference Optimization)
avg log-prob as implicit reward
• Removes the reference model by using average log-probability of a response as implicit reward
• outperforms DPO by 6+ points on AlpacaEval 2 with no extra memory cost.
ORPO (Odds Ratio Preference Optimization)
combine SFT + preference in one loss
• Merges SFT and preference alignment into a single training objective using odds ratios
• eliminates reference model and separate SFT stage—one pass instead of two.
KTO (Kahneman-Tversky Optimization)
thumbs-up / thumbs-down labels
• Works with binary feedback instead of pairwise preferences
• cheaper data collection for production systems with like/dislike signals
• derived from prospect theory.
RLVR (RL with Verifiable Rewards)
unit test pass / math checker → reward
• Uses programmatic verifiers (unit tests, math checkers) as reward signal instead of human labels
• enables emergent self-reflection and verification for math and code tasks.
DAPO (Dynamic Advantage Policy Optimization)
clip-higher + token-level loss
• Extends GRPO for long chain-of-thought reasoning with entropy collapse prevention, dynamic sampling, and token-level policy gradient
• 50% fewer steps than DeepSeek-R1-Zero.
Constitutional AI
self-critique + revision
• Model critiques its own outputs against constitutional principles then revises
• reduces reliance on human feedback for harmlessness alignment.
RLAIF (RL from AI Feedback)
AI-generated preferences
• Replaces human labelers with AI-generated feedback for reward model training
• scales preference data collection
• effective for capability-focused alignment.
Reward modeling
score(response) from pairs
• Trains classifier to predict human preference between response pairs
• converts subjective preferences into scalar reward signal for RL optimization.

Table 8: Inference Optimization Techniques

TechniqueExampleDescription
KV caching
cache computed K, V
• Stores previously computed key-value pairs during autoregressive generation
• avoids redundant computation—essential for production inference efficiency.
Prefix caching
cache system prompt KV once
• Reuses KV cache for identical prompt prefixes across requests
• skips re-computing static content like system prompts
• up to 90% cost and 85% latency reduction for long prompts.
Flash Attention
tiling + recomputation
• IO-aware attention algorithm using block-wise computation and kernel fusion
• reduces memory bandwidth by 5–20× enabling 2–4× speedup on long sequences.
PagedAttention
block-level memory management
• Manages KV cache in non-contiguous blocks like OS paging
• reduces memory fragmentation and increases batch size in vLLM by 2–24×.
Speculative decoding
draft model → verify in parallel
• Small draft model generates candidate tokens that large model verifies in parallel
• 2–3× speedup without changing output distribution.
Medusa
extra decoding heads predict t+1, t+2
• Adds multiple prediction heads to an LLM and verifies candidates via tree-based attention in parallel
• 2.2–3.6× speedup without a separate draft model.
Continuous batching
dynamic request batching
Evicts finished sequences and immediately adds new ones—in-flight batching—maximizes GPU utilization vs. static batching.
Quantization (INT8/FP8)
convert FP16 → INT8
• Reduces precision of weights/activations to 8-bit or lower
• 2–4× memory reduction and speedup with <1% accuracy loss using calibration.
Flash Attention 2/3
improved kernel fusion
• Enhanced IO-aware kernels with lower SRAM usage and better parallelization
• Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations.

Table 9: Sampling and Decoding Strategies

StrategyExampleDescription
Greedy decoding
argmax(logits)
• Selects highest probability token at each step
• deterministic and fast but prone to repetitive, low-quality outputs for creative tasks.
Temperature sampling
logits / T before softmax
• Scales logits by T—lower T (0.1–0.7) sharpens distribution
• higher T (1.0–2.0) increases randomness/creativity.
Top-p (nucleus) sampling
cumulative prob >= p=0.9
• Dynamically selects smallest set of tokens whose cumulative probability exceeds threshold p
• adapts to distribution shape—preferred for open-ended generation.
Top-k sampling
sample from top k=40 tokens
• Restricts sampling to k highest-probability tokens
• prevents sampling rare tokens but fixed k can be too restrictive or permissive depending on distribution.
Min-p sampling
filter tokens < min_p * max_prob
• Removes tokens with probability below min_p fraction of the maximum token probability
• better than top-p for maintaining quality while allowing creativity.
Beam search
keep top-k sequences
• Maintains k parallel hypotheses and expands most probable
• balances quality and diversity but computationally expensive
• common for translation.
Contrastive search
argmax(model_score - α·cos_sim)
• Selects tokens that are probable but distinct from the previous context via cosine similarity penalty
• reduces repetition while maintaining coherence.

Table 10: Training Optimizations

TechniqueExampleDescription
Mixed precision (FP16/BF16)
store in FP16, compute in FP32
• Uses 16-bit floats for storage/computation with FP32 master weights
• reduces memory by ~2× and accelerates training with Tensor Cores—BF16 preferred for stability.
Gradient checkpointing
recompute activations
• Trades compute for memory by recomputing forward activations during backward pass instead of storing
• enables 2–10× larger models at 20–30% slowdown.
Gradient accumulation
accumulate over N steps
• Simulates larger effective batch size by accumulating gradients across microbatches before update
• crucial for training large models on limited memory.
AdamW optimizer
decouple weight decay
• Fixes weight decay in Adam by applying it directly to weights, not gradient
• improves generalization
• standard optimizer for transformer training.
Learning rate warmup
linear increase 0 → max_lr
• Gradually increases learning rate from zero over initial steps (typically 2–10% of training)
• stabilizes training of large models and prevents divergence.
Cosine annealing
lr = min + 0.5(max-min)(1+cos)
• Decreases learning rate following cosine curve
• smooth decay helps model converge to flatter minima
• often combined with warmup.
GaLore (Gradient Low-Rank Projection)
project grad to low-rank subspace
• Projects gradients into a low-rank subspace via periodic SVD
• enables full-rank training with optimizer memory footprint comparable to LoRA
• no inference overhead.

Table 11: Distributed Training Strategies

StrategyExampleDescription
Data parallelism
replicate model across GPUs
• Each GPU holds full model copy processing different data batches
• gradients averaged across devices
• simplest parallelism but memory-limited by single GPU.
Tensor parallelism
split layers across GPUs
• Partitions individual layers (e.g., attention heads) within model across devices
• requires frequent all-reduce communication
• used in Megatron-LM for very large models.
ZeRO (Zero Redundancy Optimizer)
shard optimizer states
• Partitions optimizer states, gradients, parameters across devices with on-demand gathering
• ZeRO-3 achieves model parallelism memory efficiency with data parallelism simplicity.
FSDP (Fully Sharded Data Parallel)
PyTorch native ZeRO-3
• PyTorch implementation of ZeRO-3 sharding
• automatically manages parameter gathering/scattering
• simpler API than DeepSpeed for distributed training.
Pipeline parallelism
layer stages on different GPUs
• Splits model vertically by layers creating pipeline stages
• microbatching reduces bubble overhead
• GPipe, PipeDream frameworks.
Sequence parallelism
partition along sequence dim
• Splits sequence length across GPUs for memory-constrained layers like LayerNorm
• extends tensor parallelism reducing activation memory in long sequences.

Table 12: Model Compression Techniques

TechniqueExampleDescription
Post-training quantization
GPTQ, AWQ
• Converts trained model weights to lower precision (INT4/8) without retraining
• calibration on small dataset
• 3–4× compression with minimal accuracy loss.
Knowledge distillation
student learns from teacher
• Trains small student model to match outputs or intermediate representations of large teacher
• compresses model while retaining capabilities—DistilBERT example.
Pruning (structured/unstructured)
remove low-magnitude weights
• Removes unimportant weights or entire structured components (heads, layers)
• can reduce parameters 40–60% but requires careful calibration or retraining.
Quantization-aware training (QAT)
simulate quantization during training
• Inserts fake quantization operations in forward pass to model precision effects
• typically achieves better accuracy than post-training methods.
Low-rank decomposition
W ≈ UV (rank r)
• Approximates weight matrices as product of low-rank matrices
• reduces parameters in linear layers
• basis of LoRA and similar PEFT methods.

Table 13: Context Window and Long-Context Techniques

TechniqueExampleDescription
Retrieval-Augmented Generation (RAG)
retrieve docs → augment prompt
• Retrieves relevant documents from external knowledge base and injects into context
• extends effective knowledge beyond model limits
• requires good retrieval system.
RoPE scaling
adjust rotation frequencies
• Modifies RoPE base frequency to extrapolate to longer sequences
• simple method enabling 2–4× context extension with minimal fine-tuning.
Position interpolation
compress position indices
• Interpolates positions within training range rather than extrapolating
• better stability than direct extrapolation for extended context.
YaRN (Yet another RoPE extensioN)
scale + adjust NTK base
• Combines NTK-aware scaling with attention temperature adjustment per head
• efficient context extension from 4K to 128K+ tokens.
Sliding window attention
attend to local window
• Each token attends only to fixed-size window around its position
• linear memory but limited long-range modeling
• used in Longformer.
Sparse attention
attend to subset of positions
• Computes attention only for selected position pairs using patterns (local, strided, global)
• reduces O(n²) complexity enabling 10×+ longer sequences.
Recurrent memory
compress past into memory
• Summarizes earlier context into compressed memory state
• enables unbounded context in theory but loses fine-grained information from distant past.

Table 14: Prompt Engineering Techniques

TechniqueExampleDescription
Few-shot prompting
Example1, Example2, ... Query
• Demonstrates task through 2–10 input-output examples in prompt
• exploits in-context learning
• effectiveness grows with model scale
• examples should be diverse.
Zero-shot prompting
"Translate to French: [text]"
• Provides task instruction only without examples
• relies on pre-training and instruction tuning
• quality highly dependent on model capabilities and prompt clarity.
Chain-of-thought (CoT)
"Let's think step by step"
• Prompts model to generate intermediate reasoning steps before final answer
• dramatically improves performance on math, logic, commonsense reasoning.
Self-consistency
sample multiple paths → vote
• Generates multiple reasoning paths then selects most consistent answer via majority voting
• improves reliability over single-path CoT.
ReAct (Reasoning + Acting)
Thought → Action → Observation
• Interleaves reasoning and tool use
• model generates thoughts, selects actions (API calls, searches), observes results iteratively until solution found.
Tree-of-thoughts
explore reasoning tree
• Explores multiple reasoning branches with backtracking and evaluation
• enables deliberate problem-solving for complex tasks requiring search.
Skeleton-of-thought
outline → parallel expand
• Generates a skeleton outline first then expands each point in parallel
• reduces end-to-end latency by up to 2× on modern hardware.

Table 15: Emergent Capabilities and Scaling

ConceptExampleDescription
Scaling laws
L(N) ∝ N^(-α)
• Loss scales as power law with compute, model size, dataset size
• predicts training compute allocation
• vocabulary size also affects optimal scaling.
Compute-optimal scaling
Chinchilla scaling
• For fixed compute budget, balanced scaling of model size and training tokens is optimal
• suggests many models are undertrained relative to size.
In-context learning
few-shot without gradients
• Ability to learn new tasks from examples in prompt without parameter updates
• improves with scale
• mechanism may involve induction heads in attention layers.
Emergent abilities
reasoning, arithmetic
• Capabilities that appear suddenly at scale not present in smaller models
• includes in-context learning and chain-of-thought reasoning
• debated whether truly emergent or metric artifacts.
Transfer learning
pre-train → fine-tune
• Pre-trained models encode general language understanding transferable to downstream tasks
• foundation of modern NLP—larger models transfer better.

Table 16: Activation Functions in Transformers

FunctionExampleDescription
SwiGLU
SwiGLU(x) = Swish(xW) ⊙ (xV)
• Gated variant using Swish activation (x·sigmoid(x)) with element-wise gating
• used in LLaMA, PaLM—empirically outperforms GELU
• requires ~50% more FFN parameters for same hidden size.
GELU (Gaussian Error Linear Unit)
GELU(x) = x·Φ(x)
• Smooth approximation applying Gaussian CDF
• used in BERT, GPT-2
• better gradient properties than ReLU
• probabilistic interpretation as neuron dropout.
GeGLU
GeGLU(x) = GELU(xW) ⊙ (xV)
• Similar to SwiGLU but uses GELU for gating
• strong performance on language tasks
• used in T5 variants.
ReLU (Rectified Linear Unit)
ReLU(x) = max(0, x)
• Simple piecewise linear function
• original Transformer used ReLU
• computationally efficient but can suffer from dead neurons
• largely replaced in modern LLMs.

Table 17: Normalization Techniques

MethodExampleDescription
RMSNorm (Root Mean Square Normalization)
RMSNorm(x) = x / RMS(x) · γ
• Simplified LN removing mean centering—only normalizes by RMS
• 10–20% faster than LN with comparable performance
• used in LLaMA, Grok, Qwen3.
Layer Normalization (LN)
LN(x) = (x - μ) / σ · γ + β
• Normalizes across feature dimension for each token independently
• standard in transformers
• mean/variance computed per-sample allowing any batch size.
Pre-LN vs Post-LN
Pre: LN(x) → Sublayer
• Pre-LN applies normalization before sublayer (modern default—more stable)
• Post-LN applies after (original Transformer—requires warmup)
• Pre-LN enables easier convergence.
Dropout
randomly zero with p=0.1
• Randomly drops activations during training as regularization
• less common in very large LLMs which are underparameterized relative to data
• typical rates 0.1–0.2.

Table 18: Evaluation Metrics and Benchmarks

MetricExampleDescription
MMLU (Massive Multitask Language Understanding)
57 subjects, 4-way multiple choice
• Tests knowledge and reasoning across STEM, humanities, social sciences
• standard benchmark for general capabilities
• 0–100% accuracy.
GPQA Diamond
448 PhD-level science questions
• Tests doctoral-level knowledge in biology, physics, chemistry
• designed to be hard even with internet access
• nearing saturation for frontier models (~94% as of 2026).
HumanEval
code synthesis benchmark
• Evaluates code generation with 164 hand-written programming problems
• pass@k metric measures functional correctness
• standard for coding models.
SWE-bench
GitHub issue → code fix
• Tests real-world software engineering—resolving GitHub issues in Python repos
• measures fraction of issues resolved
• key agentic coding benchmark.
LiveBench
monthly refreshed questions
• Contamination-resistant benchmark with questions refreshed monthly using recent data sources
• covers math, coding, reasoning, language, data analysis.
Arena ELO / Chatbot Arena
human preference pairwise ranking
• Crowdsourced pairwise preference evaluation
• ELO-rated from millions of blind votes
• strong signal for real-world conversational quality.
Perplexity
PPL = exp(avg_loss)
• Measures how surprised model is by test data
• lower is better
• exponential of average cross-entropy loss
• standard language modeling metric.
BLEU
n-gram precision with brevity
• Compares n-gram overlap between generated and reference translations
• 0–100 scale
• standard for machine translation evaluation.
ROUGE
ROUGE-L, ROUGE-N
• Measures recall-oriented n-gram overlap
• primarily for summarization
• ROUGE-L uses longest common subsequence.
BERTScore
contextual embedding similarity
• Computes token similarity using BERT embeddings rather than exact matches
• captures semantic similarity better than n-gram metrics.

Table 19: Attention Mechanism Optimizations

OptimizationExampleDescription
Grouped-query attention (GQA)
heads share K, V in groups
• Compromise between MQA and MHA: groups of heads share K, V
• balances quality-efficiency tradeoff
• used in LLaMA-2/3, Mistral, Qwen3.
Multi-Head Latent Attention (MLA)
compress KV to latent → cache
• Compresses keys and values into a low-rank latent vector (joint KV compression) before caching
• reduces KV cache by up to 93% vs. MHA with better modeling quality
• used in DeepSeek-V3, Kimi K2.
Multi-query attention (MQA)
single K, V across heads
• Shares same key-value projections across all heads using only multiple queries
• reduces KV cache memory and speeds inference but slightly lower quality than GQA.
Flash Attention 2/3
improved kernel fusion
• Enhanced IO-aware kernels with lower SRAM usage and better parallelization
• Flash-3 up to 2× faster than Flash-2 on H100 with asynchrony optimizations.
Sliding window attention
attend to k nearest tokens
• Restricts attention to fixed-size local window
• reduces complexity to O(n·k) from O(n²)
• enables longer sequences but limited global context.
Linear attention
kernel trick approximation
• Approximates softmax attention using kernel methods reducing complexity to O(n)
• enables efficient very long sequences but quality gaps remain vs. full attention.

Table 20: Advanced Training Techniques

TechniqueExampleDescription
Multi-task learning
train on multiple tasks jointly
• Shares parameters across tasks expecting positive transfer
• T5 frames everything as text-to-text
• requires balanced sampling and task weighting.
Curriculum learning
easy → hard examples
• Orders training data from simple to complex
• can improve convergence and final performance
• domain-specific curriculum design needed.
Continual learning
incremental data updates
• Updates model on new data without forgetting previous knowledge
• addresses catastrophic forgetting through rehearsal, regularization, or architectural solutions.
Contrastive learning
SimCLR, CLIP
• Learns representations by contrasting positive pairs against negatives
• CLIP aligns text-image pairs
• effective for self-supervised and multimodal learning.
Data augmentation
backtranslation, paraphrasing
• Generates synthetic training variations from existing data
• back-translation, EDA, GPT-generated examples
• particularly useful for low-resource tasks.

Table 21: Model Merging Techniques

TechniqueExampleDescription
SLERP (Spherical Linear Interpolation)
t=0.5 between model A and B
• Smoothly interpolates between two models' weights in spherical space preserving geometric properties
• best for high-quality pairwise merges
• limited to two models at a time.
TIES-Merging
trim → elect sign → disjoint merge
• Three-step process: trim redundant parameters, elect dominant sign direction, merge aligned parameters
• handles multi-model merging by resolving parameter conflicts.
DARE (Drop And REscale)
drop delta weights p=0.9, rescale
• Randomly drops task-vector delta weights then rescales remaining by 1/(1−p)
• effective even dropping 90–99% of deltas
• used as augment for TIES or Task Arithmetic.
Task Arithmetic
task_vector = fine_tuned - pretrained
• Computes task vectors (delta weights) and combines via arithmetic
• add vectors to merge capabilities, negate to remove behaviors
• simple and composable.
Passthrough (layer stacking)
layers 0-32 of A + 24-32 of B
• Concatenates layers from different models to create frankenmerge with exotic parameter counts (e.g., 9B from two 7B models)
• experimental but can produce capable models.
Evolutionary merging
evolutionary search over merge configs
• Uses evolutionary algorithms to automatically discover optimal merging recipes and hyperparameters
• 50× cost reduction via MERGE³ on single GPU
• produces SOTA merged models.
Model Soup (weight averaging)
average weights of N fine-tuned models
• Averages weights of multiple fine-tuned versions of same base model
• improves accuracy without increasing inference cost
• greedy variant evaluates each addition.

Table 22: LLM Agent Concepts

ConceptExampleDescription
Tool use / Function calling
{"function": "search", "args": {...}}
• LLM selects and invokes external functions (APIs, search, code execution) via structured JSON
• the defining capability separating conversational models from agents.
ReAct loop
Thought → Action → Observation → …
• Agent iterates Thought → Action → Observation cycles until task is complete
• interleaves reasoning and acting for grounded multi-step problem solving.
Agent memory (short/long-term)
context window + vector store
• Short-term: in-context window, cleared after session
• Long-term: external vector store or database enabling retrieval across sessions.
Planning and decomposition
task → subtask1, subtask2, ...
• Agent breaks large tasks into manageable subtasks via chain-of-thought or tree-of-thought
• can reflect and revise plans based on intermediate results.
Multi-agent framework
orchestrator → specialist agents
• Multiple LLM agents collaborate: one orchestrates, others specialize
• improves performance on tasks requiring diverse expertise or parallel execution.
Agentic RAG
agent decides when/what to retrieve
• Agent dynamically decides when to retrieve, what to search, and how to use results
• contrasts with static one-shot RAG by iterating retrieval based on partial answers.
RLVR for agents
env reward → policy update
• Trains agents using verifiable environment rewards (tool execution outcomes, test pass/fail)
• enables agents to discover optimal multi-step strategies without human demonstrations.
Back to Generative AI
Next Topic: LlamaIndex Cheat Sheet

References

Official Documentation & Foundational Papers

  1. Attention Is All You Need - https://arxiv.org/abs/1706.03762
  2. BERT: Pre-training of Deep Bidirectional Transformers - https://arxiv.org/abs/1810.04805
  3. Language Models are Few-Shot Learners (GPT-3) - https://arxiv.org/abs/2005.14165
  4. Improving Language Understanding by Generative Pre-Training (GPT) - https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
  5. LLaMA: Open and Efficient Foundation Language Models - https://arxiv.org/abs/2302.13971
  6. Transformer Model Documentation - PyTorch - https://pytorch.org/docs/stable/nn.html#transformer
  7. Hugging Face Transformers Library - https://huggingface.co/docs/transformers/
  8. T5: Exploring the Limits of Transfer Learning - https://arxiv.org/abs/1910.10683
  9. RoBERTa: A Robustly Optimized BERT Pretraining Approach - https://arxiv.org/abs/1907.11692
  10. DeepSeek-V3 Technical Report - https://arxiv.org/abs/2412.19437
  11. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - https://arxiv.org/abs/2501.12948

Architecture and Attention Mechanisms

  1. Multi-Head Attention Explained - d2l.ai - https://d2l.ai/chapter_attention-mechanisms-and-transformers/multihead-attention.html
  2. The Illustrated Transformer - Jay Alammar - https://jalammar.github.io/illustrated-transformer/
  3. FlashAttention: Fast and Memory-Efficient Exact Attention - https://arxiv.org/abs/2205.14135
  4. FlashAttention-2: Faster Attention with Better Parallelism - https://arxiv.org/abs/2307.08691
  5. Efficient Memory Management for Large Language Model Serving (PagedAttention) - https://arxiv.org/abs/2309.06180
  6. Self-Attention with Relative Position Representations - https://arxiv.org/abs/1803.02155
  7. RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE) - https://arxiv.org/abs/2104.09864
  8. Train Short, Test Long: Attention with Linear Biases (ALiBi) - https://arxiv.org/abs/2108.12409
  9. Multi-Query Attention for Faster Inference - https://arxiv.org/abs/1911.02150
  10. GQA: Training Generalized Multi-Query Transformer - https://arxiv.org/abs/2305.13245
  11. Multi-Head Latent Attention (MLA) - Sebastian Raschka - https://sebastianraschka.com/llm-architecture-gallery/mla/
  12. DeepSeek-V2: Multi-Head Latent Attention Paper - https://arxiv.org/abs/2405.04434
  13. LLM Architecture Gallery 2026 - SesameDisK - https://sesamedisk.com/llm-architecture-gallery-2026/
  14. The Inner Workings of DeepSeek-V3 - Chris McCormick - https://mccormickml.com/2025/02/12/the-inner-workings-of-deep-seek-v3/
  15. NoPE: No Positional Encoding in Transformers - https://arxiv.org/abs/2404.12224
  16. Linear Transformer (Linear Attention) - https://arxiv.org/abs/2006.16236
  17. On the Relationship between Self-Attention and Convolutional Layers - https://arxiv.org/abs/1911.03584
  18. QK-Norm in Transformers - Scaling ViT - https://arxiv.org/abs/2302.05442

Tokenization and Preprocessing

  1. Neural Machine Translation of Rare Words with Subword Units (BPE) - https://arxiv.org/abs/1508.07909
  2. Google's Neural Machine Translation System (WordPiece) - https://arxiv.org/abs/1609.08144
  3. SentencePiece: A simple and language independent approach - https://arxiv.org/abs/1808.06226
  4. Subword Regularization: Improving Neural Network Translation (Unigram) - https://arxiv.org/abs/1804.10959
  5. Byte Pair Encoding Implementation Guide - Hugging Face - https://huggingface.co/learn/nlp-course/chapter6/5

Training and Optimization

  1. Decoupled Weight Decay Regularization (AdamW) - https://arxiv.org/abs/1711.05101
  2. Mixed Precision Training - https://arxiv.org/abs/1710.03740
  3. Training with Gradient Checkpointing - https://arxiv.org/abs/1604.06174
  4. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (Warmup) - https://arxiv.org/abs/1706.02677
  5. SGDR: Stochastic Gradient Descent with Warm Restarts - https://arxiv.org/abs/1608.03983
  6. Scaling Laws for Neural Language Models - https://arxiv.org/abs/2001.08361
  7. Training Compute-Optimal Large Language Models (Chinchilla) - https://arxiv.org/abs/2203.15556
  8. Scaling Laws with Vocabulary Size - https://arxiv.org/abs/2407.13623
  9. Layer Normalization - https://arxiv.org/abs/1607.06450
  10. Root Mean Square Layer Normalization (RMSNorm) - https://arxiv.org/abs/1910.07467
  11. On Layer Normalization in the Transformer Architecture (Pre-LN) - https://arxiv.org/abs/2002.04745
  12. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection - https://arxiv.org/abs/2403.03507
  13. Multi-Token Prediction (Better & Faster LLMs) - https://arxiv.org/abs/2404.19737

Fine-Tuning and Adaptation

  1. LoRA: Low-Rank Adaptation of Large Language Models - https://arxiv.org/abs/2106.09685
  2. QLoRA: Efficient Finetuning of Quantized LLMs - https://arxiv.org/abs/2305.14314
  3. DoRA: Weight-Decomposed Low-Rank Adaptation - https://arxiv.org/abs/2402.09353
  4. Parameter-Efficient Transfer Learning (Adapter Modules) - https://arxiv.org/abs/1902.00751
  5. Prefix-Tuning: Optimizing Continuous Prompts - https://arxiv.org/abs/2101.00190
  6. The Power of Scale for Parameter-Efficient Prompt Tuning - https://arxiv.org/abs/2104.08691
  7. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning - https://arxiv.org/abs/2110.07602
  8. Finetuned Language Models Are Zero-Shot Learners (FLAN) - https://arxiv.org/abs/2109.01652
  9. Scaling Instruction-Finetuned Language Models - https://arxiv.org/abs/2210.11416
  10. Hugging Face PEFT Library - https://github.com/huggingface/peft

Alignment and RLHF

  1. Training Language Models to Follow Instructions (InstructGPT/RLHF) - https://arxiv.org/abs/2203.02155
  2. Direct Preference Optimization (DPO) - https://arxiv.org/abs/2305.18290
  3. SimPO: Simple Preference Optimization with a Reference-Free Reward - https://arxiv.org/abs/2405.14734
  4. ORPO: Monolithic Preference Optimization without Reference Model - https://arxiv.org/abs/2403.07691
  5. KTO: Model Alignment as Prospect Theoretic Optimization - https://arxiv.org/abs/2402.01306
  6. Constitutional AI: Harmlessness from AI Feedback - https://arxiv.org/abs/2212.08073
  7. RLAIF: Scaling Reinforcement Learning from Human Feedback - https://arxiv.org/abs/2309.00267
  8. Learning to Summarize from Human Feedback - https://arxiv.org/abs/2009.01325
  9. Training a Helpful and Harmless Assistant with RLHF - https://arxiv.org/abs/2204.05862
  10. DeepSeekMath: Pushing the Limits of Mathematical Reasoning (GRPO) - https://arxiv.org/abs/2402.03300
  11. DAPO: An Open-Source LLM Reinforcement Learning System - https://arxiv.org/abs/2503.14476
  12. Post-Training in 2026: GRPO, DAPO, RLVR & Beyond - https://llm-stats.com/blog/research/post-training-techniques-2026
  13. DPO Variants: IPO, KTO, ORPO - https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment
  14. Kimi k1.5: Scaling Reinforcement Learning with LLMs - https://arxiv.org/abs/2501.12599
  15. OLMo 2: Fully Open Language Models - https://arxiv.org/abs/2501.00656
  16. Group Relative Policy Optimization (GRPO) - Illustrated Breakdown - https://epichka.com/blog/2025/grpo/
  17. GRPO Deep Dive - Cameron Wolfe - https://cameronrwolfe.substack.com/p/grpo

Inference and Optimization

  1. Fast Inference from Transformers via Speculative Decoding - https://arxiv.org/abs/2211.17192
  2. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads - https://arxiv.org/abs/2401.10774
  3. GPTQ: Accurate Post-Training Quantization - https://arxiv.org/abs/2210.17323
  4. AWQ: Activation-aware Weight Quantization - https://arxiv.org/abs/2306.00978
  5. SmoothQuant: Accurate and Efficient Post-Training Quantization - https://arxiv.org/abs/2211.10438
  6. LLM.int8(): 8-bit Matrix Multiplication for Transformers - https://arxiv.org/abs/2208.07339
  7. Continuous Batching for LLM Inference - https://www.anyscale.com/blog/continuous-batching-llm-inference
  8. Prefix Caching - BentoML LLM Inference Handbook - https://bentoml.com/llm/inference-optimization/prefix-caching
  9. LLM Inference Optimization Techniques - Redwerk - https://redwerk.com/blog/llm-inference-optimization-techniques/
  10. Prompt Caching: Up to 90% Cost Reduction - https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
  11. KV Cache Optimization Guide - https://blog.dailydoseofds.com/p/a-practical-deep-dive-on-llm-inference
  12. LLM Inference Optimization Guide - Morphllm - https://www.morphllm.com/llm-inference-optimization

Distributed Training

  1. Megatron-LM: Training Multi-Billion Parameter Language Models - https://arxiv.org/abs/1909.08053
  2. GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism - https://arxiv.org/abs/1811.06965
  3. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models - https://arxiv.org/abs/1910.02054
  4. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel - https://arxiv.org/abs/2304.11277
  5. Reducing Activation Recomputation in Large Transformer Models (Sequence Parallelism) - https://arxiv.org/abs/2205.05198

Model Compression

  1. Distilling the Knowledge in a Neural Network - https://arxiv.org/abs/1503.02531
  2. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot - https://arxiv.org/abs/2301.00774
  3. Wanda: A Simple and Effective Pruning Approach for LLMs - https://arxiv.org/abs/2306.11695

Long Context and RAG

  1. Retrieval-Augmented Generation for Knowledge-Intensive NLP - https://arxiv.org/abs/2005.11401
  2. Longformer: The Long-Document Transformer - https://arxiv.org/abs/2004.05150
  3. Extending Context Window via Position Interpolation - https://arxiv.org/abs/2306.15595
  4. YaRN: Efficient Context Window Extension - https://arxiv.org/abs/2309.00071
  5. Lost in the Middle: How Language Models Use Long Contexts - https://arxiv.org/abs/2307.03172

Prompt Engineering

  1. Chain-of-Thought Prompting Elicits Reasoning - https://arxiv.org/abs/2201.11903
  2. Self-Consistency Improves Chain of Thought Reasoning - https://arxiv.org/abs/2203.11171
  3. Tree of Thoughts: Deliberate Problem Solving - https://arxiv.org/abs/2305.10601
  4. ReAct: Synergizing Reasoning and Acting in Language Models - https://arxiv.org/abs/2210.03629
  5. Skeleton-of-Thought: LLMs Can Do Parallel Decoding - https://arxiv.org/abs/2307.15337
  6. The Prompt Report: A Systematic Survey - https://arxiv.org/abs/2406.06608
  7. Prompt Engineering Guide - https://www.promptingguide.ai/

Sampling and Decoding

  1. The Curious Case of Neural Text Degeneration (Nucleus Sampling) - https://arxiv.org/abs/1904.09751
  2. Hierarchical Neural Story Generation (Top-k) - https://arxiv.org/abs/1805.04833
  3. Contrastive Search for Better Language Generation - https://arxiv.org/abs/2210.14140
  4. Min-p Sampling: Balancing Quality and Diversity - https://arxiv.org/abs/2407.01082

Emergent Capabilities and Scaling

  1. Emergent Abilities of Large Language Models - https://arxiv.org/abs/2206.07682
  2. Are Emergent Abilities a Mirage? - https://arxiv.org/abs/2304.15004
  3. In-context Learning and Induction Heads - https://arxiv.org/abs/2209.11895
  4. A Survey of Large Language Models - https://arxiv.org/abs/2303.18223

Multimodal and Vision-Language

  1. Learning Transferable Visual Models From Natural Language (CLIP) - https://arxiv.org/abs/2103.00020
  2. Flamingo: a Visual Language Model for Few-Shot Learning - https://arxiv.org/abs/2204.14198
  3. Visual Instruction Tuning (LLaVA) - https://arxiv.org/abs/2304.08485
  4. Gemini: A Family of Highly Capable Multimodal Models - https://arxiv.org/abs/2312.11805

Activation Functions

  1. Gaussian Error Linear Units (GELUs) - https://arxiv.org/abs/1606.08415
  2. GLU Variants Improve Transformer (SwiGLU/GeGLU) - https://arxiv.org/abs/2002.05202
  3. Swish: A Self-Gated Activation Function - https://arxiv.org/abs/1710.05941

Architecture Variants

  1. Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer - https://arxiv.org/abs/1701.06538
  2. Switch Transformers: Scaling to Trillion Parameter Models - https://arxiv.org/abs/2101.03961
  3. Mixtral of Experts - https://arxiv.org/abs/2401.04088
  4. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts - https://arxiv.org/abs/2112.06905

Evaluation and Benchmarks

  1. Measuring Massive Multitask Language Understanding (MMLU) - https://arxiv.org/abs/2009.03300
  2. GPQA: A Graduate-Level Google-Proof Q&A Benchmark - https://arxiv.org/abs/2311.12022
  3. GPQA Benchmark Scores 2026 - BenchLM.ai - https://benchlm.ai/benchmarks/gpqa
  4. Evaluating Large Language Models Trained on Code (HumanEval) - https://arxiv.org/abs/2107.03374
  5. SWE-bench: Can Language Models Resolve Real GitHub Issues? - https://arxiv.org/abs/2310.06770
  6. LiveBench: A Challenging, Contamination-Free LLM Benchmark - https://arxiv.org/abs/2406.19314
  7. LiveBench Leaderboard - https://livebench.ai/
  8. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference - https://chat.lmsys.org/
  9. BERTScore: Evaluating Text Generation with BERT - https://arxiv.org/abs/1904.09675
  10. BLEU: a Method for Automatic Evaluation of Machine Translation - https://aclanthology.org/P02-1040/
  11. ROUGE: A Package for Automatic Evaluation of Summaries - https://aclanthology.org/W04-1013/

Model Merging

  1. Merge Large Language Models with mergekit - Hugging Face Blog - https://huggingface.co/blog/mlabonne/merge-models
  2. TIES-Merging: Resolving Interference When Merging Models - https://arxiv.org/abs/2306.01708
  3. Language Models are Super Mario: Absorbing Abilities with DARE - https://arxiv.org/abs/2311.03099
  4. Editing Models with Task Arithmetic - https://arxiv.org/abs/2212.04089
  5. Model Soups: Averaging Weights of Multiple Fine-Tuned Models - https://arxiv.org/abs/2203.05482
  6. Evolutionary Optimization of Model Merging Recipes - Nature Machine Intelligence - https://www.nature.com/articles/s42256-024-00975-8
  7. An Introduction to Model Merging for LLMs - NVIDIA Technical Blog - https://developer.nvidia.com/blog/an-introduction-to-model-merging-for-llms/
  8. mergekit - Arcee AI - https://github.com/arcee-ai/mergekit
  9. Model Merging for LLMs 2026 - Zylos Research - https://zylos.ai/research/2026-01-24-model-merging-llm

Agentic AI and Tool Use

  1. ReAct: Synergizing Reasoning and Acting - https://arxiv.org/abs/2210.03629
  2. LLM Agents: The Ultimate Guide 2026 - SuperAnnotate - https://www.superannotate.com/blog/llm-agents
  3. Agentic Artificial Intelligence: Architectures, Taxonomies - https://arxiv.org/html/2601.12560v1
  4. Tool Use and Function Calling in AI Agents 2026 - Zylos Research - https://zylos.ai/research/2026-04-07-tool-use-function-calling-standards-benchmarks

Technical Blogs and Tutorials

  1. The Illustrated Transformer - Jay Alammar - https://jalammar.github.io/illustrated-transformer/
  2. Understanding and Coding Self-Attention - Sebastian Raschka - https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention
  3. LLM Training Guide - Hugging Face (StackLLaMA) - https://huggingface.co/blog/stackllama
  4. DeepSpeed Documentation - Microsoft - https://www.deepspeed.ai/
  5. Megatron-LM Training Guide - NVIDIA - https://docs.nvidia.com/megatron-core/
  6. vLLM Inference Server - UC Berkeley - https://docs.vllm.ai/
  7. Understanding Encoder and Decoder LLMs - Sebastian Raschka - https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder
  8. Flash Attention Explained - DataCamp - https://www.datacamp.com/blog/flash-attention
  9. LLMs in 2026: What's Real, What's Hype - Infotech - https://www.infotech.com/digital-disruption/llms-in-2026-what-s-real-what-s-hype-and-what-s-coming-next
  10. Large Language Models and AI Engineering in 2026 - The AI Cowboys - https://theaicowboys.com/blog/large-language-models-llms-ai-engineering-2026

Advanced Topics

  1. Contrastive Learning with SimCLR - https://arxiv.org/abs/2002.05709
  2. Knowledge Distillation Survey - https://arxiv.org/abs/2006.05525
  3. Curriculum Learning for NLP - https://arxiv.org/abs/2101.10382
  4. Continual Learning for LLMs - https://arxiv.org/abs/2302.00487
  5. A Survey on In-context Learning - https://arxiv.org/abs/2301.00234
  6. State Space Models (Mamba) - https://arxiv.org/abs/2312.00752
  7. Transformer Quality in Linear Time - https://arxiv.org/abs/2202.10447
  8. Reasoning Models Generate Societies of Thought (DeepSeek-R1) - https://arxiv.org/html/2601.10825v1
  9. DeepSeek-R1 incentivizes reasoning through pure RL - Nature - https://www.nature.com/articles/s41586-025-09422-z

Industry Resources

  1. OpenAI API Documentation - https://platform.openai.com/docs/
  2. Anthropic Claude Documentation - https://docs.anthropic.com/
  3. Google Gemini Technical Report - https://deepmind.google/technologies/gemini/
  4. Meta LLaMA Model Card - https://github.com/facebookresearch/llama
  5. Mistral AI Documentation - https://docs.mistral.ai/
  6. Cohere LLM Documentation - https://docs.cohere.com/
  7. Together AI Platform - https://docs.together.ai/
  8. Weights & Biases LLM Training - https://wandb.ai/site/solutions/llmops

Video Resources

  1. Andrej Karpathy's Neural Networks: Zero to Hero - https://karpathy.ai/zero-to-hero.html
  2. Stanford CS324 - Large Language Models - https://stanford-cs324.github.io/winter2022/
  3. Stanford CS336 Language Modeling from Scratch Spring 2026 - https://www.youtube.com/watch?v=lVynu4bo1rY
  4. DeepLearning.AI LLM Courses - https://www.deeplearning.ai/courses/
  5. How to Train LLMs to Think (o1 & DeepSeek-R1) - YouTube - https://www.youtube.com/watch?v=RveLjcNl0ds

GitHub Repositories

  1. transformers - Hugging Face - https://github.com/huggingface/transformers
  2. llama - Meta AI - https://github.com/facebookresearch/llama
  3. flash-attention - Dao-AILab - https://github.com/Dao-AILab/flash-attention
  4. vllm - UC Berkeley - https://github.com/vllm-project/vllm
  5. DeepSpeed - Microsoft - https://github.com/microsoft/DeepSpeed
  6. Megatron-LM - NVIDIA - https://github.com/NVIDIA/Megatron-LM
  7. peft - Hugging Face - https://github.com/huggingface/peft
  8. axolotl - OpenAccess AI Collective - https://github.com/OpenAccess-AI-Collective/axolotl
  9. llama.cpp - ggerganov - https://github.com/ggerganov/llama.cpp
  10. Medusa - FasterDecoding - https://github.com/FasterDecoding/Medusa
  11. mergekit - Arcee AI - https://github.com/arcee-ai/mergekit

Research Conferences and Archives

  1. NeurIPS 2025 Papers - https://neurips.cc/
  2. ICLR 2026 Papers - https://iclr.cc/
  3. ACL 2026 Findings - https://aclanthology.org/
  4. ICML 2025 Proceedings - https://icml.cc/
  5. arXiv cs.CL Recent Papers - https://arxiv.org/list/cs.CL/recent

More in Generative AI

  • LangSmith Cheat Sheet
  • LlamaIndex Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • Chain-of-Thought Reasoning Cheat Sheet
  • Knowledge Distillation Cheat Sheet
  • Multimodal AI Cheat Sheet
View all 77 topics in Generative AI