String Processing and Text Manipulation Cheat Sheet

Updated 2026-05-16

Next Topic: TOML Configuration Format Cheat Sheet

String processing is the cornerstone of modern software development, encompassing everything from basic concatenation to complex Unicode handling across dozens of writing systems. Whether you're parsing log files, validating user input, or building natural language processing pipelines, understanding encoding schemes, regex engines, and performance tradeoffs dictates whether your application scales gracefully or collapses under production load. This cheat sheet bridges the gap between simple string operations and production-ready text handling—covering character encoding (UTF-8 vs UTF-16), regex engines (PCRE2 vs RE2), parser design, localization strategies, and memory optimization techniques that separate robust systems from brittle prototypes.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 119 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Character Encoding FundamentalsTable 2: String Interpolation Across LanguagesTable 3: Regular Expression EnginesTable 4: Regex Advanced FeaturesTable 5: Parsing and TokenizationTable 6: String Builders and MutabilityTable 7: Common String OperationsTable 8: String Formatting and TemplatesTable 9: Internationalization (i18n) and Localization (l10n)Table 10: String Security and SanitizationTable 11: String Performance OptimizationTable 12: Substring Search AlgorithmsTable 13: Escape Sequences and Special CharactersTable 14: Modern String APIs and Features

Table 1: Character Encoding Fundamentals

Before you can manipulate text safely, you need to know how characters actually live in memory—and that's where most bugs are born. These entries cover the bridge between abstract Unicode code points and the concrete bytes on disk: the UTF-8/16/32 encodings, surrogate pairs, normalization forms, and the invisible characters (ZWJ, BOM) that make a single user-perceived "character" sometimes span several code points.

Encoding	Example	Description
UTF-8	`"Hello" → 48 65 6C 6C 6F` `"€" → E2 82 AC (3 bytes)`	• Variable-width encoding using 1–4 bytes per code point • backward-compatible with ASCII for first 128 characters • most widely used on web and Unix systems
UTF-16	`"Hello" → 0048 0065 006C 006C 006F` `"𝕳" → D835 DD33 (surrogate pair)`	• Uses 2 or 4 bytes per code point • requires surrogate pairs (U+D800–U+DFFF) for characters beyond Basic Multilingual Plane • dominant in Windows, Java, JavaScript internals
UTF-32	`"A" → 00 00 00 41 (4 bytes always)`	• Fixed-width encoding using exactly 4 bytes per code point • simplifies indexing but wastes memory for most text • rarely used outside specialized applications
Byte Order Mark (BOM)	UTF-8: `EF BB BF` UTF-16 LE: `FF FE` UTF-16 BE: `FE FF`	• Special sequence at file start indicating encoding and endianness; • Optional in UTF-8 (often omitted on Unix) • Required in UTF-16/32 to distinguish little-endian from big-endian
Code Point	`U+0041` (A) `U+1F600` (😀)	• Unique number assigned to each character in Unicode • range U+0000 to U+10FFFF • distinct from code unit (byte representation)

Table 1: Character Encoding Fundamentals

Encoding	Example	Description
UTF-8	`"Hello" → 48 65 6C 6C 6F` `"€" → E2 82 AC (3 bytes)`	• Variable-width encoding using 1–4 bytes per code point • backward-compatible with ASCII for first 128 characters • most widely used on web and Unix systems
UTF-16	`"Hello" → 0048 0065 006C 006C 006F` `"𝕳" → D835 DD33 (surrogate pair)`	• Uses 2 or 4 bytes per code point • requires surrogate pairs (U+D800–U+DFFF) for characters beyond Basic Multilingual Plane • dominant in Windows, Java, JavaScript internals
UTF-32	`"A" → 00 00 00 41 (4 bytes always)`	• Fixed-width encoding using exactly 4 bytes per code point • simplifies indexing but wastes memory for most text • rarely used outside specialized applications
Byte Order Mark (BOM)	UTF-8: `EF BB BF` UTF-16 LE: `FF FE` UTF-16 BE: `FE FF`	• Special sequence at file start indicating encoding and endianness; • Optional in UTF-8 (often omitted on Unix) • Required in UTF-16/32 to distinguish little-endian from big-endian
Code Point	`U+0041` (A) `U+1F600` (😀)	• Unique number assigned to each character in Unicode • range U+0000 to U+10FFFF • distinct from code unit (byte representation)