Docs2Synth: A Synthetic Data Trained Retriever Framework for Scanned Visually Rich Documents Understanding
Yihao Ding, Qiang Sun, Puzhen Wu, Sirui Li, Siwen Luo, Wei Liu

TL;DR
Docs2Synth is a framework that uses synthetic supervision to train a visual retriever, improving domain-specific document understanding by reducing hallucinations and enhancing grounding without manual annotations.
Contribution
It introduces a novel synthetic-supervision approach for training retrievers that enhances domain-specific document understanding in VRDU tasks.
Findings
Significantly improves grounding and domain generalization in VRDU benchmarks.
Reduces hallucination and increases response consistency in document understanding.
Does not require manual annotations for training.
Abstract
Document understanding (VRDU) in regulated domains is particularly challenging, since scanned documents often contain sensitive, evolving, and domain specific knowledge. This leads to two major challenges: the lack of manual annotations for model adaptation and the difficulty for pretrained models to stay up-to-date with domain-specific facts. While Multimodal Large Language Models (MLLMs) show strong zero-shot abilities, they still suffer from hallucination and limited domain grounding. In contrast, discriminative Vision-Language Pre-trained Models (VLPMs) provide reliable grounding but require costly annotations to cover new domains. We introduce Docs2Synth, a synthetic-supervision framework that enables retrieval-guided inference for private and low-resource domains. Docs2Synth automatically processes raw document collections, generates and verifies diverse QA pairs via an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
