spaCy is a free, open-source library for industrial-strength Natural Language Processing in Python and Cython, designed specifically for production use with state-of-the-art speed and accuracy. Unlike research-oriented libraries, spaCy emphasizes practical deployment with pre-trained pipelines for 75+ languages, efficient batch processing via streaming APIs, and a clean, consistent interface. The library's architecture centers on the processing pipeline — a sequence of components (tokenizer, tagger, parser, NER) that transform raw text into rich linguistic annotations stored in immutable Doc objects. One key insight: spaCy encodes all strings as hash values in a shared Vocab, enabling memory-efficient representation while maintaining fast lookups; this design choice permeates the entire system and explains why you access attributes via both hashed IDs and string properties (e.g., token.lemma vs token.lemma_).
What This Cheat Sheet Covers
This topic spans 15 focused tables and 78 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Container Objects
These are the data structures every spaCy program revolves around — once nlp(text) runs, the linguistic annotations all live inside these objects. The hierarchy is what matters here: a Doc holds Tokens, a Span is a slice of that Doc, and both the Vocab and its Lexemes sit underneath as the shared, context-free string store that makes everything memory-efficient.
| Object | Example | Description |
|---|---|---|
doc = nlp("Hello world")tokens = [t for t in doc] | • Container for processed text holding an array of Token objects • immutable after creation and provides access to linguistic annotations like entities, sentences, and noun chunks | |
token = doc[0]print(token.text, token.pos_) | • Individual word or punctuation unit with linguistic attributes like POS tags, lemmas, dependencies, and entity labels • view into Doc's data rather than independent copy | |
span = doc[2:5]print(span.text) | • Slice of a Doc object representing a contiguous sequence of tokens • retains all document context and can be converted to standalone Doc using span.as_doc(). | |
nlp.vocab.strings.add("custom")hash_val = nlp.vocab.strings["word"] | • Shared vocabulary and lookup table that stores all strings as hash values for memory efficiency • contains Lexeme objects representing word types without context |