spaCy Industrial NLP Library Cheat Sheet

Updated 2026-05-02

Next Topic: Supervised Learning Cheat Sheet

🧠Study flashcards on this topic77 cards · spaced repetition→

spaCy is a free, open-source library for industrial-strength Natural Language Processing in Python and Cython, designed specifically for production use with state-of-the-art speed and accuracy. Unlike research-oriented libraries, spaCy emphasizes practical deployment with pre-trained pipelines for 75+ languages, efficient batch processing via streaming APIs, and a clean, consistent interface. The library's architecture centers on the processing pipeline — a sequence of components (tokenizer, tagger, parser, NER) that transform raw text into rich linguistic annotations stored in immutable Doc objects. One key insight: spaCy encodes all strings as hash values in a shared Vocab, enabling memory-efficient representation while maintaining fast lookups; this design choice permeates the entire system and explains why you access attributes via both hashed IDs and string properties (e.g., token.lemma vs token.lemma_).

What This Cheat Sheet Covers

This topic spans 15 focused tables and 78 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Container ObjectsTable 2: Pipeline ComponentsTable 3: Rule-Based MatchingTable 4: Pipeline ManagementTable 5: Training and ConfigurationTable 6: Model Loading and PackagingTable 7: Custom ExtensionsTable 8: Linguistic AttributesTable 9: Word Vectors and SimilarityTable 10: Serialization and StorageTable 11: VisualizationTable 12: RetokenizationTable 13: Multi-Language SupportTable 14: Performance OptimizationTable 15: Utility Functions

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Core Container Objects

These are the data structures every spaCy program revolves around — once nlp(text) runs, the linguistic annotations all live inside these objects. The hierarchy is what matters here: a Doc holds Tokens, a Span is a slice of that Doc, and both the Vocab and its Lexemes sit underneath as the shared, context-free string store that makes everything memory-efficient.

Object	Example	Description
Doc	`doc = nlp("Hello world")` `tokens = [t for t in doc]`	• Container for processed text holding an array of Token objects • immutable after creation and provides access to linguistic annotations like entities, sentences, and noun chunks
Token	`token = doc[0]` `print(token.text, token.pos_)`	• Individual word or punctuation unit with linguistic attributes like POS tags, lemmas, dependencies, and entity labels • view into Doc's data rather than independent copy
Span	`span = doc[2:5]` `print(span.text)`	• Slice of a Doc object representing a contiguous sequence of tokens • retains all document context and can be converted to standalone Doc using `span.as_doc()`.
Vocab	`nlp.vocab.strings.add("custom")` `hash_val = nlp.vocab.strings["word"]`	• Shared vocabulary and lookup table that stores all strings as hash values for memory efficiency • contains Lexeme objects representing word types without context

Table 1: Core Container Objects

Object	Example	Description
Doc	`doc = nlp("Hello world")` `tokens = [t for t in doc]`	• Container for processed text holding an array of Token objects • immutable after creation and provides access to linguistic annotations like entities, sentences, and noun chunks
Token	`token = doc[0]` `print(token.text, token.pos_)`	• Individual word or punctuation unit with linguistic attributes like POS tags, lemmas, dependencies, and entity labels • view into Doc's data rather than independent copy
Span	`span = doc[2:5]` `print(span.text)`	• Slice of a Doc object representing a contiguous sequence of tokens • retains all document context and can be converted to standalone Doc using `span.as_doc()`.
Vocab	`nlp.vocab.strings.add("custom")` `hash_val = nlp.vocab.strings["word"]`	• Shared vocabulary and lookup table that stores all strings as hash values for memory efficiency • contains Lexeme objects representing word types without context