In the world of “docs-as-code,” we often treat Word documents as a commodity. We assume that as long as we follow the WordprocessingML (OOXML) specification, our templates will be portable across any editor.
This month, a deep dive into Issue #539 reminded me that the “standard” is only as good as the tools we use to interpret it. What began as a bug report about a broken Table of Contents (TOC) evolved into a cleanup of our SdtRun processing and a lesson in the hidden ambiguities of library abstractions.
The Discovery: When a Text is not a Text
The root cause of the TOC corruption was a subtle blind spot in how office-stamper interacted with Docx4j. In OOXML, normal text is stored in w:t elements. However, complex fields like TOC instructions or page numbers use w:instrText.
The “Aha!” moment came when I realized that Docx4j uses the same Java class (org.docx4j.wml.Text) to represent both. Microsoft Word is “smart” enough to guess the function based on the XML context. But because office-stamper trusted the class type alone to identify elements, it was treating instruction text as regular content. This led to TOC fields being partially overwritten or corrupted during the stamping process.
What Changed in June
1. Context-Aware Text Handling
We no longer rely on the class type alone. The engine is now aware of the treatment difference between instrText and regular text. This ensures that field instructions (the “code” behind the TOC) are preserved exactly as they should be, while the displayed values are updated correctly.
2. Google Docs Compatibility (The SdtRun Cleanup)
By resolving the TOC issue, I discovered that documents generated by Google Docs use SdtRun (Structured Document Tags) much more extensively than Microsoft Word. These elements often wrap conditional blocks or repeated paragraphs in ways the engine didn’t fully account for.
I’ve refactored SdtRun processing to be more robust, including:
- Precise element removal: Enhanced
WmlUtilsto handle child removal specifically fromSdtRunparents. - Comment Deletion: Added support for deleting comments (used as metadata) from within
SdtRunelements. - Improved Search: Optimized
DocumentUtilto navigate throughSdtRuncontainers during depth-first searches.
3. Stringifier Evolution: FldChar and Hyperlinks
Our Characterization Testing pipeline got an upgrade. The Stringifier utility now includes:
FldChar: Exposing field boundaries in test assertions.Hyperlink: Representing links within the TOC.
This means that whenever I touch the core engine, the tests will immediately flag if a TOC link is broken or if a field character is misplaced—even if the document still “looks” fine in a basic viewer.
Agility and Craftsmanship Impact
As a solo maintainer, I cannot afford to manual-test every template in both Word and Google Docs. These changes reflect a core principle of this project: Correctness must be built into the pipeline.
- Reduced Brittle-ness: By cleaning up
SdtRunand acknowledging theinstrTextambiguity, we’ve removed a class of bugs that felt like “magic” regressions. - Confidence in Refactoring: The updated
Stringifierprovides a “Golden String” assertion that covers the entire document structure, not just the visible text.
Impact and Next Steps
The immediate impact is high-fidelity document generation that survives the round-trip between different editors. Whether your team uses Google Docs for collaboration or Word for final polish, the TOC and conditional logic will now remain intact.
Next Steps: I’m looking into further automating the verification of “field results” (the numbers in the TOC). While we now validate the structure, ensuring the page numbers are perfectly updated without a manual “Update Field” click in Word remains the next frontier.
July 1st 2025 — Commit Summary:
- Feat: Comprehensive TOC validation in regression tests.
- Feat: Handle
FldCharandHyperlinkin stringification. - Refactor: Clean up
SdtRunparent/child handling for Google Docs compatibility. - Chore: Update
pitestto1.20.0andjunitto5.13.2.