In the world of “docs-as-code,” we often treat Word documents as a commodity. We assume that as long as we follow the WordprocessingML (OOXML) specification, our templates will be portable across any editor.
This month, a deep dive into Issue #539 reminded me that
the “standard” is only as good as the tools we use to interpret it. What began
as a bug report about a broken Table of Contents (TOC) evolved into a cleanup of
our SdtRun processing and a lesson in the hidden ambiguities of library
abstractions.
The Discovery: When a Text is not a Text
The root cause of the TOC corruption was a subtle blind spot in how
office-stamper interacted with Docx4j. In OOXML, normal text is stored in
w:t elements. However, complex fields like TOC instructions or page numbers
use w:instrText.
The “Aha!” moment came when I realized that Docx4j uses the **same Java class
** (org.docx4j.wml.Text) to represent both. Microsoft Word is “smart” enough
to guess the function based on the XML context. But because office-stamper
trusted the class type alone to identify elements, it was treating instruction
text as regular content. This led to TOC fields being partially overwritten or
corrupted during the stamping process.
What Changed in June
1. Context-Aware Text Handling
We no longer rely on the class type alone. The engine is now aware of the
treatment difference between instrText and regular text. This ensures that
field instructions (the “code” behind the TOC) are preserved exactly as they
should be, while the displayed values are updated correctly.
2. Google Docs Compatibility (The SdtRun Cleanup)
By resolving the TOC issue, I discovered that documents generated by Google Docs
use SdtRun (Structured Document Tags) much more extensively than Microsoft
Word. These elements often wrap conditional blocks or repeated paragraphs in
ways the engine didn’t fully account for.
I’ve refactored SdtRun processing to be more robust, including:
- Precise element removal: Enhanced
WmlUtilsto handle child removal specifically fromSdtRunparents. - Comment Deletion: Added support for deleting comments (used as metadata)
from within
SdtRunelements. - Improved Search: Optimized
DocumentUtilto navigate throughSdtRuncontainers during depth-first searches.
3. Stringifier Evolution: FldChar and Hyperlinks
Our Characterization Testing
pipeline got an upgrade. The Stringifier utility now includes:
FldChar: Exposing field boundaries in test assertions.Hyperlink: Representing links within the TOC.
This means that whenever I touch the core engine, the tests will immediately flag if a TOC link is broken or if a field character is misplaced—even if the document still “looks” fine in a basic viewer.
Agility and Craftsmanship Impact
As a solo maintainer, I cannot afford to manual-test every template in both Word and Google Docs. These changes reflect a core principle of this project: Correctness must be built into the pipeline.
- Reduced Brittle-ness: By cleaning up
SdtRunand acknowledging theinstrTextambiguity, we’ve removed a class of bugs that felt like “magic” regressions. - Confidence in Refactoring: The updated
Stringifierprovides a “Golden String” assertion that covers the entire document structure, not just the visible text.
Impact and Next Steps
The immediate impact is high-fidelity document generation that survives the round-trip between different editors. Whether your team uses Google Docs for collaboration or Word for final polish, the TOC and conditional logic will now remain intact.
Next Steps: I’m looking into further automating the verification of “field results” (the numbers in the TOC). While we now validate the structure, ensuring the page numbers are perfectly updated without a manual “Update Field” click in Word remains the next frontier.
July 1st 2025 — Commit Summary:
- Feat: Comprehensive TOC validation in regression tests.
- Feat: Handle
FldCharandHyperlinkin stringification. - Refactor: Clean up
SdtRunparent/child handling for Google Docs compatibility. - Chore: Update
pitestto1.20.0andjunitto5.13.2.