In the world of document automation, we often fall into the trap of testing
how instead of the what. For years, office-stamper tests were heavily
focused on the underlying XML structure of WordProcessingML. We would assert
that a specific w:p (paragraph) contained a specific w:r (run), and that the
run had the correct w:t (text).
Here’s the stark reality: Users aren’t concerned with XML operations. Their focus is on whether the intended duplication of a line occurred, and if the table row designated for removal has indeed been removed.
As part of the ongoing evolution of office-stamper, I’ve been shifting our
testing strategy toward Characterization Testing (also known as Golden
Master testing). This journey began in early 2024 (
see Unit Tests Don’t Mean What You Think)
when I first introduced the Stringifier.
Since then, the goal has evolved from merely “surviving XML churn” to testing high-level features holistically while making the tests themselves readable enough to serve as documentation. This strategy was the bedrock of our 2.0 Modular Reorg, allowing deep internal changes without regression.
The Technique Taxonomy: XML vs. Stringification
1. Low-Level XML Assertions (The Legacy)
Initially, tests used XPath or deep object traversal to verify output.
- Pros: Precise.
- Cons: Brittle. A minor refactor in how runs are split (common in Word) would break dozens of tests even if the visual output was identical.
2. Characterization via Stringification (The Future)
Instead of looking at XML, we turn the document into a simplified textual representation. We then compare this “ stringified” document against a baseline.
- Pros: Holistic, readable, and captures the user’s perspective.
- Cons: Requires a carefully tuned “Stringifier” to avoid noise.
The Stringifier Utility: Docs-as-Code for Tests
The heart of this new approach is a Stringifier utility. It traverses the
WordprocessingMLPackage and converts complex elements into a human-readable
format.
What makes it powerful is the Visibility Heuristic: If a normal user cannot see the difference when opening the document in Word, the difference shouldn’t exist in our test representation.
For example, we map Word styles to simple text markers:
private Function<? super String, String> decorateWithStyle(String value) {
return switch (value) {
case "heading 1" -> "== %s\n"::formatted;
case "heading 2" -> "=== %s\n"::formatted;
case "caption" -> ".%s"::formatted;
default -> "[%s] %%s".formatted(value)::formatted;
};
}
This turns a multi-megabyte XML structure into something like this in our test assertions:
== Simpson Family
[Normal] Homer Simpson
[Normal] Patriarch
[Normal] "D'oh!" is Homer's trademark exclamation.
Impact: Sped-up Development and Hidden Bug Discovery
Moving to large textual assertions in tests like ConditionalDisplayTest has
provided two major advantages, especially when paired with
our Declarative Testing
approach:
- Increased Velocity: When I refactored the internal configuration and
registry system, I didn’t have to fix dozens of broken XML paths. By using
makeResourceto generate templates from text andStringifierto verify the output, we’ve created a complete “text-to-text” pipeline. As long as the stringified output matched, I knew the feature was intact. - Holistic Awareness: Because these tests watch the entire document, they catch regressions that targeted unit tests miss—like a paragraph accidentally losing its spacing or a footer being unintentionally stripped.
The Solo Maintainer’s Perspective: Managing Risk
Does this replace API testing? Absolutely not. Characterization tests are great for verifying behavior, but they won’t stop you from breaking your public API.
As a solo maintainer, I use a layered defense:
corepackage isolation: The messy internals stay incore, which clients aren’t supposed to touch.apiandpresetpackages: These are the stable extension points.- CLI Client: I maintain a “vanilla” client in the
climodule. If the CLI still works, the primary user path is safe.
Characterization tests sit alongside these, providing the confidence to refactor the “scary” parts of the engine without fear.
Pitfalls to Avoid
- Noise Pollution: If you include XML IDs or timestamps in your string representation, your tests will fail on every run. Apply the visibility heuristic ruthlessly.
- The “Over-Green” Trap: It is easy to just update the baseline when a test
fails. Always review the
diffto ensure the change in output is actually what you intended.
Checklist — Implementing Characterization Tests
- Define Visibility: Decide which formatting elements (bold, headers, lists) actually matter to your users.
- Automate Stringification: Build a utility that converts your output format to a stable, text-based representation.
- Establish Baselines: Run your existing, trusted code to generate “Golden Masters.”
- Review Diffs: Treat test failures as a conversation. “Is this change in the document intentional?”
- Isolate the Core: Ensure your high-level tests aren’t coupled to the same interfaces you’re trying to refactor.
By focusing on what the user sees, we make our tests more resilient, our documentation more accurate, and our maintenance more sustainable.