Edge AI and TinyML (Tiny Machine Learning) bring machine learning inference directly to resource-constrained devices like microcontrollers, embedded systems, and IoT endpoints. Edge AI runs on moderately powerful edge devices (~100mW to several watts), while TinyML pushes ML capabilities onto ultra-low-power microcontrollers operating at milliwatt-level consumption (often <1mW idle). The key innovation is deploying optimized neural networks directly on-device rather than relying on cloud servers, enabling real-time inference with enhanced privacy, reduced latency, and minimal connectivity dependence. Successful Edge AI deployment hinges on aggressive model optimization (quantization, pruning, knowledge distillation), understanding hardware accelerator capabilities (NPU, DSP, GPU delegation), and navigating the tradeoff triangle of accuracy, latency, and power consumption.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 108 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts and Definitions
Before optimizing anything, it helps to fix the vocabulary that everyone in this field throws around — the difference between TinyML and the broader Edge AI umbrella, why latency and memory footprint dominate every decision, and where the line between running on-device and falling back to the cloud actually sits.
| Concept | Example | Description |
|---|---|---|
ML inference on Arduino Nano 33 BLE Sense (256KB RAM) | • Machine learning running on microcontrollers with <1MB memory and milliwatt-scale power • focuses on ultra-constrained environments where every kilobyte matters | |
Object detection on NVIDIA Jetson Nano | • Broader category encompassing ML inference on edge devices from powerful SBCs to smartphones • typically 100mW-10W power range with megabytes to gigabytes of memory | |
Real-time face recognition on iPhone Neural Engine | • Executing trained ML models locally on end-user devices without cloud connectivity • model weights and computations stay on the device | |
Converting TensorFlow model to TFLite and flashing to ESP32 | • Process of converting, optimizing, and embedding a trained model into firmware on target hardware • includes format conversion and integration with application code |