Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Transformer Architecture Cheat Sheet

Transformer Architecture Cheat Sheet

Back to Generative AI
Updated 2026-05-25
Next Topic: Variational Autoencoders (VAEs) Cheat Sheet

Transformer architecture, introduced in 2017 by Vaswani et al., revolutionized deep learning by replacing recurrent and convolutional layers with an attention-based mechanism that processes sequences in parallel. Unlike RNNs and LSTMs, transformers eliminate sequential dependencies through self-attention, allowing every token to attend to every other token simultaneously — enabling massively parallel training that scales to hundreds of billions of parameters. At its core, the transformer relies on three key concepts: multi-head attention for capturing diverse relationships, positional encodings for sequence order awareness, and residual connections with layer normalization for stable deep network training. Understanding transformers is essential not only because they power modern large language models like GPT, BERT, LLaMA, and DeepSeek, but also because their architectural innovations — from encoder-decoder structures to efficient attention variants and hybrid architectures — have influenced every domain in AI, from vision (ViT) to multimodal and agentic systems.

What This Cheat Sheet Covers

This topic spans 20 focused tables and 115 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Attention MechanismsTable 2: Query, Key, Value ComputationsTable 3: Positional Encoding TechniquesTable 4: Layer Normalization and Residual ConnectionsTable 5: Feed-Forward NetworkTable 6: Encoder and Decoder ArchitecturesTable 7: Transformer Variants and ModelsTable 8: Model HyperparametersTable 9: Training TechniquesTable 10: Optimization AlgorithmsTable 11: Training Paradigms and ObjectivesTable 12: Post-Training AlignmentTable 13: Parameter-Efficient Fine-Tuning (PEFT)Table 14: Token Embeddings and VocabularyTable 15: Inference and Decoding StrategiesTable 16: Attention Complexity and EfficiencyTable 17: Inference Serving and OptimizationTable 18: Model QuantizationTable 19: Advanced Architectural ConceptsTable 20: Alternative Architectures

Table 1: Core Attention Mechanisms

Self-attention is the engine of every transformer block: it lets each token build a context-aware representation by attending to all other tokens simultaneously. The many variants below trade off memory, speed, and quality differently — choosing the right one is critical for modern LLM design.

MechanismExampleDescription
Self-Attention
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
• Allows each position to attend to all positions in the same sequence
• query, key, and value all derived from the same input
• forms the foundation of every transformer encoder and decoder block
Scaled Dot-Product Attention
score = (Q @ K.T) / sqrt(d_k)
attn_weights = softmax(score)
• Computes attention scores via dot product of queries and keys, scaled by \frac{1}{\sqrt{d_k}} to prevent gradient saturation in softmax
• scaling maintains stable variance as embedding dimensions grow
Multi-Head Attention
8 heads with d_{model}=512
head_dim = 512/8 = 64
• Splits Q, K, V into h parallel attention heads each of dimension d_k = d_{model}/h
• enables the model to jointly attend to information from different representation subspaces
• outputs concatenated and linearly projected
Cross-Attention
Decoder attends to encoder:
Q=decoder, K=encoder, V=encoder
• Queries come from one sequence (decoder) while keys and values come from another (encoder)
• allows decoder to focus on relevant encoder positions
• essential in encoder-decoder architectures for seq2seq tasks
Masked Self-Attention
Upper-triangular mask:
[[0,-inf,-inf],[0,0,-inf],[0,0,0]]
• Prevents positions from attending to subsequent positions by applying a causal mask before softmax
• ensures token i only sees tokens \leq i
• used in decoder-only models (GPT) and decoder blocks

More in Generative AI

  • Token Management Cheat Sheet
  • Variational Autoencoders (VAEs) Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • ColBERT and Late Interaction Retrieval Cheat Sheet
  • LangSmith Cheat Sheet
  • NL-to-SQL and Text-to-Code Generation Cheat Sheet
View all 95 topics in Generative AI