GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification
Souhail Bakkali, Sanket Biswas, Zuheng Ming, Micka\"el Coustaty,, Mar\c{c}al Rusi\~nol, Oriol Ramos Terrades, Josep Llad\'os

TL;DR
GlobalDoc is a novel cross-modal transformer framework that enhances real-world document image retrieval and classification by integrating language and visual data through self-supervised pre-training and new evaluation tasks.
Contribution
It introduces a unified cross-modal architecture with three novel pretext tasks and proposes two new document-level evaluation benchmarks for industrial scenarios.
Findings
GlobalDoc outperforms existing models on new downstream tasks.
The framework demonstrates robustness and transferability in practical settings.
Self-supervised pre-training improves semantic understanding of documents.
Abstract
Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and often suffer a significant performance drop in real-world online industrial settings. A primary issue is their heavy reliance on OCR engines to extract local positional information within document pages, which limits the models' ability to capture global information and hinders their generalizability, flexibility, and robustness. In this paper, we introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner using three novel pretext objective tasks. GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations, resulting in more transferable models. For proper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Image and Video Retrieval Techniques
