Natural Language Processing (NLP) is the branch of artificial intelligence concerned with enabling computers to understand, interpret, and generate human language in ways that are both meaningful and useful. At its core, NLP bridges the gap between human communication and machine processing by converting unstructured text into structured representations that algorithms can operate on. The field spans everything from simple text preprocessing tasks like tokenization and stemming to advanced contextual understanding through transformer-based models and large language models (LLMs). A key mental model is the processing pipeline: raw text enters through preprocessing stages (cleaning, tokenization, normalization), transforms into numerical representations (embeddings, vectors), and flows through analysis layers (syntactic, semantic) to produce actionable insights or generated language. Understanding this progression—from words as symbols to words as vectors in semantic space to words understood in context by LLMs—unlocks the ability to build systems that not only parse language but comprehend context, intent, and meaning.
What This Cheat Sheet Covers
This topic spans 22 focused tables and 157 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Text Preprocessing Fundamentals
| Technique | Example | Description |
|---|---|---|
text.split() → ['The', 'cat', 'sat'] | • Splits text into individual units (tokens) such as words, subwords, or characters • the foundation of all NLP pipelines. | |
"Hello World".lower() → "hello world" | Converts all characters to lowercase to reduce vocabulary size and treat "Hello" and "hello" as the same token. | |
Remove ['the', 'is', 'at'] from sentence | • Eliminates high-frequency, low-information words (e.g., "the", "is", "and") to reduce noise • use cautiously as context can matter in sentiment analysis. | |
"Hello, world!" → "Hello world" | • Strips punctuation marks to simplify text and reduce feature dimensionality • may lose meaning in special cases (e.g., "don't" vs "dont"). | |
"running" → "run" | • Chops word endings using heuristic rules (e.g., Porter, Snowball) to derive a root form • fast but may produce non-words like "troubl" from "trouble". |