Document recognition as transcription : A new perspective
A collaboration with Swiss engineering specialist Bossard Group on a digital parts production platform revealed a critical bottleneck: the automated interpretation of engineering drawings. This industrial challenge sparked foundational research with significant implications for the broader fields of document recognition and the development of future document foundation models.
Many specialized documents, like engineering drawings or sheet music, follow a strict set of rules—a kind of "visual grammar"—to convey precise information. Yet, most modern AI systems try to understand them by treating them as simple pictures, and frame document understanding as a computer vision task. For example, a common approach is to use object detection to find individual symbols, which is inherently incomplete because it fails to capture the essential relationships that connect those symbols. As a result, these systems must rely on sub-optimal, heuristic post-processing to reassemble the document's meaning, a process that fails for many complex document types.
Our research proposes a fundamental shift in perspective that frames document recognition as transcription task. This involves converting a document's visual information into its single, underlying "record"—the complete, structured data the document was designed to convey. Crucially, for document types that follow strict standards, this transcription is unambiguous. While a single record can be visually represented in many ways, any valid document can be read back into only one single, correct record, providing a stable target for an AI model to learn.
This “document-to-record” perspective implies a natural grouping of documents based on the intrinsic structure inherent in their transcription. To leverage this, we developed a method to embed these structures as inductive biases directly into a flexible base transformer architecture and training process. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.
Full preprint: https://lnkd.in/ekPvsqE7
For infos about our partner: Bossard Group