Neural Networks Architecture encompasses the structural design of artificial neural systemsβfrom foundational feedforward networks to specialized architectures like CNNs (convolutional for images), RNNs/LSTMs (recurrent for sequences), Transformers (attention-based for parallelizable sequence processing), Diffusion Models (iterative denoising for generation), and State Space Models like Mamba (linear-time sequence modeling). Modern architectures emerged from addressing key challenges: CNNs solve spatial pattern recognition via convolution, Transformers replace recurrence with self-attention for state-of-the-art NLP and vision, and Diffusion Transformers (DiTs) are displacing U-Nets as the backbone of generative models. A critical insight: architecture choice defines what a network can learnβresidual connections in ResNet enable training 152+ layer networks, attention captures long-range dependencies, and selective state updates in Mamba achieve Transformer-quality modeling with linear rather than quadratic compute scaling.
What This Cheat Sheet Covers
This topic spans 25 focused tables and 195 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Foundational Network Types
| Type | Example | Description |
|---|---|---|
input β hidden β output (no cycles) | β’ Unidirectional information flow β neurons in one layer connect only to the next β’ used for tabular data and classification. | |
input β fc1(256) β fc2(128) β output | β’ FNN with multiple hidden layers and nonlinear activations β’ universal function approximator and backbone of fully connected layers. | |
conv β pool β conv β pool β flatten β fc | β’ Specialized for spatial data (images) β’ learns hierarchical features via convolution filters exploiting spatial locality and parameter sharing. | |
multi-head attention + feedforward (no recurrence) | β’ Attention-based architecture processing sequences in parallel β’ replaced RNNs in NLP and vision β’ enables GPT, BERT, ViT through self-attention. | |
h_t = \tanh(W_h h_{t-1} + W_x x_t) | β’ Designed for sequential data with temporal dependencies β’ maintains hidden state across time steps β’ suffers from vanishing gradients. | |
forget gate β input gate β output gate | β’ RNN variant with gating mechanisms to preserve long-term dependencies β’ cell state acts as memory highway β’ state-of-the-art for sequences before Transformers. | |
update gate + reset gate (simpler than LSTM) | β’ Simplified LSTM with fewer parameters β’ combines forget/input gates into update gate β’ often matches LSTM performance with faster training. |