Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents

Yiqi Lin; Alex Jinpeng Wang; Linjie Li; Zhengyuan Yang; Mike Zheng Shou

arXiv:2510.18703·cs.CV·October 22, 2025

Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents

Yiqi Lin, Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Mike Zheng Shou

PDF

Open Access

TL;DR

This paper introduces VC2L, a vision-centric contrastive learning framework that models complex web documents in pixel space, improving cross-modal retrieval and understanding without relying on OCR or explicit modality fusion.

Contribution

The paper proposes a unified, pixel-space vision transformer approach for contrastive learning on web documents, addressing complex interleaved text and images without explicit text processing.

Findings

01

VC2L achieves competitive or superior performance on multiple benchmarks.

02

The approach effectively models complex web documents without OCR or text tokenization.

03

New benchmarks demonstrate the method's ability to generalize to unseen data.

Abstract

Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning