ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Ahmed Masry; Megh Thakkar; Patrice Bechard; Sathwik Tejaswi Madhusudhan; Rabiul Awal; Shambhavi Mishra; Akshay Kalkunte Suresh; Srivatsava Daruru; Enamul Hoque; Spandana Gella; Torsten Scholak; Sai Rajeswar

arXiv:2511.00903·cs.CL·November 4, 2025

ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Ahmed Masry, Megh Thakkar, Patrice Bechard, Sathwik Tejaswi Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte Suresh, Srivatsava Daruru, Enamul Hoque, Spandana Gella, Torsten Scholak, Sai Rajeswar

PDF

Open Access 1 Video

TL;DR

ColMate is a novel multimodal document retrieval model that employs OCR-based pretraining, masked contrastive learning, and late interaction scoring, achieving significant improvements over existing methods especially in out-of-domain scenarios.

Contribution

The paper introduces ColMate, a new multimodal retrieval model with specialized pretraining and scoring mechanisms tailored for multimodal documents, addressing limitations of text-only adapted methods.

Findings

01

Achieves 3.61% improvement on ViDoRe V2 benchmark.

02

Demonstrates stronger generalization to out-of-domain data.

03

Utilizes OCR-based pretraining and masked contrastive learning.

Abstract

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Information Retrieval and Search Behavior