One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text
Abhinav Java, Shripad Deshmukh, Milan Aggarwal, Surgan Jandial,, Mausoom Sarkar, Balaji Krishnamurthy

TL;DR
This paper introduces MONOMER, a novel one-shot snippet detection method that leverages visual, textual, and spatial cues to locate similar document snippets, outperforming existing baselines in structured document search.
Contribution
MONOMER is the first model to address one-shot document snippet detection by fusing multimodal information, trained on synthetic data due to limited real data.
Findings
MONOMER outperforms baselines like BHRL and LayoutLMv3 in snippet detection accuracy.
Synthetic data effectively trains MONOMER, validated by human studies.
Multimodal fusion improves detection in complex document layouts.
Abstract
Active consumption of digital documents has yielded scope for research in various applications, including search. Traditionally, searching within a document has been cast as a text matching problem ignoring the rich layout and visual cues commonly present in structured documents, forms, etc. To that end, we ask a mostly unexplored question: "Can we search for other similar snippets present in a target document page given a single query instance of a document snippet?". We propose MONOMER to solve this as a one-shot snippet detection task. MONOMER fuses context from visual, textual, and spatial modalities of snippets and documents to find query snippet in target documents. We conduct extensive ablations and experiments showing MONOMER outperforms several baselines from one-shot object detection (BHRL), template matching, and document understanding (LayoutLMv3). Due to the scarcity of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
