Neural Networks Architecture Cheat Sheet

Updated 2026-04-28

Next Topic: Neural Networks Core Cheat Sheet

🧠Study flashcards on this topic174 cards · spaced repetition→

Neural Networks Architecture encompasses the structural design of artificial neural systems—from foundational feedforward networks to specialized architectures like CNNs (convolutional for images), RNNs/LSTMs (recurrent for sequences), Transformers (attention-based for parallelizable sequence processing), Diffusion Models (iterative denoising for generation), and State Space Models like Mamba (linear-time sequence modeling). Modern architectures emerged from addressing key challenges: CNNs solve spatial pattern recognition via convolution, Transformers replace recurrence with self-attention for state-of-the-art NLP and vision, and Diffusion Transformers (DiTs) are displacing U-Nets as the backbone of generative models. A critical insight: architecture choice defines what a network can learn—residual connections in ResNet enable training 152+ layer networks, attention captures long-range dependencies, and selective state updates in Mamba achieve Transformer-quality modeling with linear rather than quadratic compute scaling.

What This Cheat Sheet Covers

This topic spans 25 focused tables and 195 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundational Network TypesTable 2: CNN Core ComponentsTable 3: Popular CNN ArchitecturesTable 4: RNN and Sequential ArchitecturesTable 5: Transformer ComponentsTable 6: Transformer and LLM Architecture VariantsTable 7: Efficient Attention MechanismsTable 8: Activation FunctionsTable 9: Regularization TechniquesTable 10: Loss FunctionsTable 11: Optimization AlgorithmsTable 12: Learning Rate SchedulingTable 13: Weight InitializationTable 14: Advanced Architecture PatternsTable 15: Training Challenges and SolutionsTable 16: GAN Architectures and TechniquesTable 17: Autoencoder VariantsTable 18: Diffusion Model ArchitecturesTable 19: Graph Neural Network ComponentsTable 20: Specialized ArchitecturesTable 21: Object Detection ArchitecturesTable 22: Vision Transformer ComponentsTable 23: Normalization TechniquesTable 24: Transfer Learning StrategiesTable 25: State Space Models and Hybrid Architectures

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Foundational Network Types

A field guide to the major architecture families, each invented to solve a problem the previous one couldn't. Feedforward nets handle tabular data, CNNs exploit spatial structure in images, RNNs and LSTMs carry state through sequences, Transformers replace recurrence with attention, and the generative families (VAEs, GANs, diffusion) learn to create rather than classify. The single most useful idea to take from this table is that the architecture you pick defines the kind of structure the network can exploit—there is no universal best.

Type	Example	Description
Feedforward Neural Network (FNN)	`input → hidden → output` (no cycles)	• Unidirectional information flow — neurons in one layer connect only to the next • used for tabular data and classification.
Multilayer Perceptron (MLP)	`input → fc1(256) → fc2(128) → output`	• FNN with multiple hidden layers and nonlinear activations • universal function approximator and backbone of fully connected layers.
Convolutional Neural Network (CNN)	`conv → pool → conv → pool → flatten → fc`	• Specialized for spatial data (images) • learns hierarchical features via convolution filters exploiting spatial locality and parameter sharing.
Transformer	`multi-head attention + feedforward` (no recurrence)	• Attention-based architecture processing sequences in parallel • replaced RNNs in NLP and vision • enables GPT, BERT, ViT through self-attention.
Recurrent Neural Network (RNN)	$h_t = \tanh(W_h h_{t-1} + W_x x_t)$	• Designed for sequential data with temporal dependencies • maintains hidden state across time steps • suffers from vanishing gradients.
Long Short-Term Memory (LSTM)	`forget gate → input gate → output gate`	• RNN variant with gating mechanisms to preserve long-term dependencies • cell state acts as memory highway • state-of-the-art for sequences before Transformers.
Gated Recurrent Unit (GRU)	`update gate + reset gate` (simpler than LSTM)	• Simplified LSTM with fewer parameters • combines forget/input gates into update gate • often matches LSTM performance with faster training.

Table 1: Foundational Network Types

Type	Example	Description
Feedforward Neural Network (FNN)	`input → hidden → output` (no cycles)	• Unidirectional information flow — neurons in one layer connect only to the next • used for tabular data and classification.
Multilayer Perceptron (MLP)	`input → fc1(256) → fc2(128) → output`	• FNN with multiple hidden layers and nonlinear activations • universal function approximator and backbone of fully connected layers.
Convolutional Neural Network (CNN)	`conv → pool → conv → pool → flatten → fc`	• Specialized for spatial data (images) • learns hierarchical features via convolution filters exploiting spatial locality and parameter sharing.
Transformer	`multi-head attention + feedforward` (no recurrence)	• Attention-based architecture processing sequences in parallel • replaced RNNs in NLP and vision • enables GPT, BERT, ViT through self-attention.
Recurrent Neural Network (RNN)	$h_t = \tanh(W_h h_{t-1} + W_x x_t)$	• Designed for sequential data with temporal dependencies • maintains hidden state across time steps • suffers from vanishing gradients.
Long Short-Term Memory (LSTM)	`forget gate → input gate → output gate`	• RNN variant with gating mechanisms to preserve long-term dependencies • cell state acts as memory highway • state-of-the-art for sequences before Transformers.
Gated Recurrent Unit (GRU)	`update gate + reset gate` (simpler than LSTM)	• Simplified LSTM with fewer parameters • combines forget/input gates into update gate • often matches LSTM performance with faster training.