Process

The pipeline behind the grammar. Each step is reproducible from disk, and every result carries its evidence back to a verbatim quote.

Anti-fabrication first. No concept, rule, type, operation or constraint exists without ≥1 affirmation quoting the source. Plausible-but-unsourced values stay in an explicit UNRESOLVED list (currently 84 entries across all domains) — never invented.

Orchestration

Pipeline steps

The full sequence — OCR, linguistic validation, semantic tokenisation, concept clustering, rule extraction, formal-grammar synthesis. About 30+ steps, each reproducible from disk.

View the full sequence →

Human-in-the-loop

OCR review queue

Manual arbitration on the ambiguous tokens flagged by Layer A (linguistic) and Layer B (vision-LLM). 276 cases reviewed so far.

276 reviewed →

Diagnostics

OCR audit

Page-by-page OCR review with side-by-side scan + extracted text. Used to track residual quirks (visarga ambiguities, glued svaras, devanāgarī line tagging).

274 pages audited →

Diagnostics

Stats

Raw counters across the pipeline : per-page, per-layer, per-confidence-bucket. Useful for diagnosing where data is dense / sparse.

Cross-step counters →