The pipeline behind the grammar. Each step is reproducible from disk, and every result carries its evidence back to a verbatim quote.
UNRESOLVED list (currently 84 entries across all domains) — never invented.
The full sequence — OCR, linguistic validation, semantic tokenisation, concept clustering, rule extraction, formal-grammar synthesis. About 30+ steps, each reproducible from disk.
Manual arbitration on the ambiguous tokens flagged by Layer A (linguistic) and Layer B (vision-LLM). 276 cases reviewed so far.
Page-by-page OCR review with side-by-side scan + extracted text. Used to track residual quirks (visarga ambiguities, glued svaras, devanāgarī line tagging).
Raw counters across the pipeline : per-page, per-layer, per-confidence-bucket. Useful for diagnosing where data is dense / sparse.