Object detection sits at the core of computer vision β it locates and classifies every object of interest in an image, outputting both class labels and bounding box coordinates. The field has split into two dominant families: two-stage detectors like Faster R-CNN that propose regions before classifying them, and one-stage detectors like the YOLO series that predict all boxes in a single forward pass, trading a small accuracy margin for dramatic speed gains. A third paradigm, transformer-based detection (DETR and its descendants), reformulates detection as a set-prediction problem with no anchor heuristics and no NMS. Understanding which family suits a deployment target β and how each model's backbone, neck, and head interact β is the key to getting the most out of any detection pipeline.
What This Cheat Sheet Covers
This topic spans 18 focused tables and 107 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Object Detection Paradigms
One-stage vs two-stage vs transformer-based detection represents three fundamentally different approaches to the same problem. Choosing between them depends on your latency budget, accuracy requirements, and whether you need anchor-free simplicity or are comfortable tuning anchor hyperparameters.
| Model | Example | Description |
|---|---|---|
model = YOLO("yolo26n.pt")results = model("img.jpg") | Predicts class and box in a single forward pass over a dense grid; faster than two-stage but historically traded slight accuracy for speed. | |
Stage 1: RPN β ROI proposals Stage 2: classify + refine ROIs | Generates region proposals first, then classifies each; higher accuracy on small/dense objects at the cost of latency. | |
model = RT-DETR()No NMS, no anchors | Formulates detection as set prediction using Hungarian matching; eliminates anchors and NMS entirely. |