Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin, Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Mike Zheng Shou

TL;DR
This paper introduces VC2L, a vision-centric contrastive learning framework that models complex web documents in pixel space, improving cross-modal retrieval and understanding without relying on OCR or explicit modality fusion.
Contribution
The paper proposes a unified, pixel-space vision transformer approach for contrastive learning on web documents, addressing complex interleaved text and images without explicit text processing.
Findings
VC2L achieves competitive or superior performance on multiple benchmarks.
The approach effectively models complex web documents without OCR or text tokenization.
New benchmarks demonstrate the method's ability to generalize to unseen data.
Abstract
Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
