Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model
Mengyao Xu, Gabriel Moreira, Ronay Ak, Radek Osmulski, Yauhen Babakhin, Zhiding Yu, Benedikt Schifferer, Even Oldridge

TL;DR
This paper introduces llama-nemoretriever-colembed, a unified text-image retrieval model that achieves state-of-the-art results on multiple benchmarks by combining a vision-language model with a ColBERT-style interaction and a two-stage training process.
Contribution
The paper presents a novel unified multimodal retrieval model with a modified architecture and a two-stage training strategy, achieving top performance on key benchmarks.
Findings
3B model scores NDCG@5 of 91.0 on ViDoRe V1
3B model scores NDCG@5 of 63.5 on ViDoRe V2
Analysis of trade-offs in storage and efficiency due to fine-grained retrieval mechanism
Abstract
Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/llama-nemotron-embed-vl-1b-v2model· 42k dl· ♡ 5042k dl♡ 50
- 🤗nvidia/llama-nemoretriever-colembed-1b-v1model· 95 dl· ♡ 2695 dl♡ 26
- 🤗nvidia/llama-nemoretriever-colembed-3b-v1model· 322 dl· ♡ 74322 dl♡ 74
- 🤗Waheeddahar/llama-nemotron-embed-vl-1b-v2model· 12 dl12 dl
- 🤗tomaarsen/llama-nemotron-embed-vl-1b-v2model· 27 dl27 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsADaptive gradient method with the OPTimal convergence rate
