CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

Jiahao Huo; Yu Huang; Yibo Yan; Ye Pan; Kening Zheng; Wei-Chieh Huang; Yi Cao; Mingdong Ou; Philip S. Yu; Xuming Hu

arXiv:2601.21262·cs.CL·April 17, 2026

CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Kening Zheng, Wei-Chieh Huang, Yi Cao, Mingdong Ou, Philip S. Yu, Xuming Hu

PDF

1 Repo 2 Models

TL;DR

CausalEmbed introduces an auto-regressive approach for visual document embedding that significantly reduces token count and storage overhead while maintaining high retrieval performance.

Contribution

It proposes a novel auto-regressive multi-vector embedding method with iterative margin loss, enabling scalable and efficient visual document retrieval.

Findings

01

Achieves 30-155x reduction in visual tokens used.

02

Maintains competitive retrieval performance across benchmarks.

03

Demonstrates advantages in training efficiency and scalability.

Abstract

Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Z1zs/Causal-Embed
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.