Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

Mengyao Xu; Gabriel Moreira; Ronay Ak; Radek Osmulski; Yauhen Babakhin; Zhiding Yu; Benedikt Schifferer; Even Oldridge

arXiv:2507.05513·cs.CV·July 9, 2025

Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model

Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, Even Oldridge

PDF

Open Access 5 Models

TL;DR

This paper introduces llama-nemoretriever-colembed, a unified text-image retrieval model that achieves state-of-the-art results on multiple benchmarks by combining a vision-language model with a ColBERT-style interaction and a two-stage training process.

Contribution

The paper presents a novel unified multimodal retrieval model with a modified architecture and a two-stage training strategy, achieving top performance on key benchmarks.

Findings

01

3B model scores NDCG@5 of 91.0 on ViDoRe V1

02

3B model scores NDCG@5 of 63.5 on ViDoRe V2

03

Analysis of trade-offs in storage and efficiency due to fine-grained retrieval mechanism

Abstract

Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsADaptive gradient method with the OPTimal convergence rate