NL-to-SQL and Text-to-Code Generation Cheat Sheet

Updated 2026-05-18

Next Topic: Ollama (Local LLM Runtime) Cheat Sheet

Natural Language to SQL (NL-to-SQL) and text-to-code generation enable users to interact with databases and generate code using plain language, removing the need for manual query writing. These systems rely on large language models (LLMs), retrieval-augmented generation (RAG), and semantic parsing to translate intent into executable code. The field has rapidly evolved from basic template-based approaches to sophisticated multi-agent architectures with self-correction, achieving over 85% accuracy on benchmarks like BIRD and Spider. Key challenges include handling large schemas (100+ tables), cross-domain generalization, ambiguity resolution, and SQL dialect differences. Production systems must balance accuracy, latency, and security while managing context windows that can exceed 200K tokens for enterprise databases. Understanding when to use schema linking, query decomposition, execution feedback loops, and validation strategies determines whether a system simply generates SQL or delivers reliable, scalable data access.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 103 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Schema-Aware Retrieval TechniquesTable 2: In-Context Learning and Prompt EngineeringTable 3: Query Decomposition and ReasoningTable 4: Validation and Self-CorrectionTable 5: Benchmark Datasets and EvaluationTable 6: Prominent Architectures and FrameworksTable 7: Model Training and Fine-TuningTable 8: SQL Dialect and Cross-Domain HandlingTable 9: Semantic Parsing and NLU ComponentsTable 10: Multi-Agent and Agentic PatternsTable 11: Security and Safety PatternsTable 12: Context Length and OptimizationTable 13: Code Generation Quality MetricsTable 14: Advanced SQL Features and PatternsTable 15: Natural Language to Pandas and Python

Table 1: Schema-Aware Retrieval Techniques

Schema-aware retrieval identifies which tables, columns, and relationships are relevant to a user's question before generating SQL. In databases with hundreds of tables, sending the entire schema to an LLM exceeds context limits and degrades accuracy. Modern approaches use vector embeddings, semantic similarity, and two-stage retrieval to select only the 5-15 most relevant schema elements, reducing hallucinations and improving generation quality.

Technique	Example	Description
Schema Linking	User: "sales by region" → Links to `orders.region`, `sales.amount`	• Maps natural language terms to specific database tables and columns using semantic similarity • the single most critical step for accurate SQL generation, directly impacting downstream query correctness
Two-Stage Retrieval	Stage 1: Retrieve top 20 tables Stage 2: Retrieve columns from top 5	• First retrieves candidate tables, then retrieves detailed column metadata only for selected tables • reduces context size by 80-90% for large schemas while preserving relevant information
Vector Embeddings for Schema	`CREATE TABLE users` → embedding Query embedding → cosine similarity	• Encodes table and column definitions as dense vectors for semantic search • captures synonyms and domain terminology better than keyword matching (e.g., "customer" matches "client" table).
Metadata-Agnostic Representations	Align NL question and SQL query in shared space → Retrieves similar examples	• Learns joint embeddings of natural language questions and SQL queries for in-context example selection • improves few-shot accuracy by 15-20% by finding structurally similar past queries

Table 1: Schema-Aware Retrieval Techniques

Technique	Example	Description
Schema Linking	User: "sales by region" → Links to `orders.region`, `sales.amount`	• Maps natural language terms to specific database tables and columns using semantic similarity • the single most critical step for accurate SQL generation, directly impacting downstream query correctness
Two-Stage Retrieval	Stage 1: Retrieve top 20 tables Stage 2: Retrieve columns from top 5	• First retrieves candidate tables, then retrieves detailed column metadata only for selected tables • reduces context size by 80-90% for large schemas while preserving relevant information
Vector Embeddings for Schema	`CREATE TABLE users` → embedding Query embedding → cosine similarity	• Encodes table and column definitions as dense vectors for semantic search • captures synonyms and domain terminology better than keyword matching (e.g., "customer" matches "client" table).
Metadata-Agnostic Representations	Align NL question and SQL query in shared space → Retrieves similar examples	• Learns joint embeddings of natural language questions and SQL queries for in-context example selection • improves few-shot accuracy by 15-20% by finding structurally similar past queries