String processing is the cornerstone of modern software development, encompassing everything from basic concatenation to complex Unicode handling across dozens of writing systems. Whether you're parsing log files, validating user input, or building natural language processing pipelines, understanding encoding schemes, regex engines, and performance tradeoffs dictates whether your application scales gracefully or collapses under production load. This cheat sheet bridges the gap between simple string operations and production-ready text handling—covering character encoding (UTF-8 vs UTF-16), regex engines (PCRE2 vs RE2), parser design, localization strategies, and memory optimization techniques that separate robust systems from brittle prototypes.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Character Encoding Fundamentals
| Encoding | Example | Description |
|---|---|---|
"Hello" → 48 65 6C 6C 6F"€" → E2 82 AC (3 bytes) | • Variable-width encoding using 1–4 bytes per code point • backward-compatible with ASCII for first 128 characters • most widely used on web and Unix systems | |
"Hello" → 0048 0065 006C 006C 006F"𝕳" → D835 DD33 (surrogate pair) | • Uses 2 or 4 bytes per code point • requires surrogate pairs (U+D800–U+DFFF) for characters beyond Basic Multilingual Plane • dominant in Windows, Java, JavaScript internals | |
"A" → 00 00 00 41 (4 bytes always) | • Fixed-width encoding using exactly 4 bytes per code point • simplifies indexing but wastes memory for most text • rarely used outside specialized applications | |
UTF-8: EF BB BFUTF-16 LE: FF FEUTF-16 BE: FE FF | • Special sequence at file start indicating encoding and endianness; • Optional in UTF-8 (often omitted on Unix) • Required in UTF-16/32 to distinguish little-endian from big-endian | |
U+0041 (A)U+1F600 (😀) | • Unique number assigned to each character in Unicode • range U+0000 to U+10FFFF • distinct from code unit (byte representation) |