Natural Language Processing (NLP) Cheat Sheet

Updated 2026-04-28

Next Topic: Neural Architecture Search (NAS) Cheat Sheet

🧠Study flashcards on this topic128 cards · spaced repetition→

Natural Language Processing (NLP) is the branch of artificial intelligence concerned with enabling computers to understand, interpret, and generate human language in ways that are both meaningful and useful. At its core, NLP bridges the gap between human communication and machine processing by converting unstructured text into structured representations that algorithms can operate on. The field spans everything from simple text preprocessing tasks like tokenization and stemming to advanced contextual understanding through transformer-based models and large language models (LLMs). A key mental model is the processing pipeline: raw text enters through preprocessing stages (cleaning, tokenization, normalization), transforms into numerical representations (embeddings, vectors), and flows through analysis layers (syntactic, semantic) to produce actionable insights or generated language. Understanding this progression—from words as symbols to words as vectors in semantic space to words understood in context by LLMs—unlocks the ability to build systems that not only parse language but comprehend context, intent, and meaning.

What This Cheat Sheet Covers

This topic spans 22 focused tables and 157 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Text Preprocessing FundamentalsTable 2: Tokenization StrategiesTable 3: Text Vectorization and RepresentationTable 4: Contextual Embeddings and TransformersTable 5: Part-of-Speech Tagging and Syntactic AnalysisTable 6: Named Entity Recognition and Information ExtractionTable 7: Sentiment Analysis and Text ClassificationTable 8: Sequence Labeling and TaggingTable 9: Language Modeling and Text GenerationTable 10: Topic Modeling and Document AnalysisTable 11: Machine Translation and Sequence-to-SequenceTable 12: Question Answering and Information RetrievalTable 13: Text SummarizationTable 14: Semantic Analysis and UnderstandingTable 15: Text Similarity and MatchingTable 16: Speech and Audio ProcessingTable 17: Language Detection and Multilingual NLPTable 18: Data Augmentation for NLPTable 19: Evaluation Metrics for NLPTable 20: Popular NLP Libraries and FrameworksTable 21: Prompt Engineering and In-Context LearningTable 22: LLM Fine-Tuning and Alignment

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Text Preprocessing Fundamentals

Before any model touches your text, it has to be cleaned and standardized — and the choices you make here ripple through everything downstream. These are the everyday cleanup steps that turn messy raw strings into consistent tokens: splitting words, folding case, dropping stop words and punctuation, and collapsing variants down to a common root. A few of them are lossy by design, so the real skill is knowing which to apply for your task and which to skip.

Technique	Example	Description
Tokenization	`text.split()` → `['The', 'cat', 'sat']`	• Splits text into individual units (tokens) such as words, subwords, or characters • the foundation of all NLP pipelines.
Lowercasing	`"Hello World".lower()` → `"hello world"`	Converts all characters to lowercase to reduce vocabulary size and treat "Hello" and "hello" as the same token.
Stop words removal	Remove `['the', 'is', 'at']` from sentence	• Eliminates high-frequency, low-information words (e.g., "the", "is", "and") to reduce noise • use cautiously as context can matter in sentiment analysis.
Punctuation removal	`"Hello, world!"` → `"Hello world"`	• Strips punctuation marks to simplify text and reduce feature dimensionality • may lose meaning in special cases (e.g., "don't" vs "dont").
Stemming	`"running"` → `"run"`	• Chops word endings using heuristic rules (e.g., Porter, Snowball) to derive a root form • fast but may produce non-words like "troubl" from "trouble".

Table 1: Text Preprocessing Fundamentals

Technique	Example	Description
Tokenization	`text.split()` → `['The', 'cat', 'sat']`	• Splits text into individual units (tokens) such as words, subwords, or characters • the foundation of all NLP pipelines.
Lowercasing	`"Hello World".lower()` → `"hello world"`	Converts all characters to lowercase to reduce vocabulary size and treat "Hello" and "hello" as the same token.
Stop words removal	Remove `['the', 'is', 'at']` from sentence	• Eliminates high-frequency, low-information words (e.g., "the", "is", "and") to reduce noise • use cautiously as context can matter in sentiment analysis.
Punctuation removal	`"Hello, world!"` → `"Hello world"`	• Strips punctuation marks to simplify text and reduce feature dimensionality • may lose meaning in special cases (e.g., "don't" vs "dont").
Stemming	`"running"` → `"run"`	• Chops word endings using heuristic rules (e.g., Porter, Snowball) to derive a root form • fast but may produce non-words like "troubl" from "trouble".