Embeddings are dense vector representations that map discrete data (text, images, code, audio, graphs) into continuous high-dimensional spaces where semantic similarity corresponds to geometric proximity. They power modern AI applications including search, retrieval-augmented generation (RAG), recommendation systems, and classification. Unlike sparse representations that encode presence/absence, embeddings capture nuanced meaning and relationships through learned patterns, enabling machines to compare, cluster, and reason about complex data using distance metrics. The field has rapidly shifted from static word embeddings toward instruction-aware decoder-only LLMs achieving SOTA results on the MTEB leaderboard.
What This Cheat Sheet Covers
This topic spans 24 focused tables and 129 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Embedding Concepts
| Concept | Example | Description |
|---|---|---|
[0.12, -0.45, 0.89, ..., 0.33] (768-D) | • Maps discrete tokens/objects into continuous numerical vectors where each dimension encodes learned semantic features • typical sizes range from 128 to 4096 dimensions. | |
similarity("dog", "puppy") > similarity("dog", "car") | Embeddings encode meaning such that semantically related concepts cluster together in vector space, enabling similarity search via distance calculation. | |
text-embedding-3-small: 1536-Dtext-embedding-3-large: 3072-D | • Number of values in each vector • higher dimensions capture more nuanced distinctions but increase memory and compute cost. | |
Learned 768-D manifold preserving semantic structure | • High-dimensional geometric space where embeddings live • distance and direction encode relationships between data points. |