CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Minghui Fang; Shengpeng Ji; Jialong Zuo; Hai Huang; Yan Xia; Jieming Zhu; Xize Cheng; Xiaoda Yang; Wenrui Liu; Gang Wang; Zhenhua Dong; Zhou Zhao

arXiv:2406.17507·cs.IR·December 2, 2025

CART: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, Zhenhua Dong, Zhou Zhao

PDF

Open Access

TL;DR

CART introduces a generative cross-modal retrieval framework that employs coarse-to-fine semantic modeling with discretized multimodal data, enhancing retrieval accuracy and efficiency.

Contribution

The paper presents a novel generative retrieval framework using coarse-to-fine semantic modeling and discretization techniques, reducing training costs and inference latency.

Findings

01

Achieves superior retrieval performance compared to traditional methods.

02

Demonstrates improved efficiency in large-scale cross-modal retrieval.

03

Validates effectiveness through extensive experiments.

Abstract

Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data. Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates, which is challenged by training cost and inference latency with large-scale data. Inspired by the remarkable performance and efficiency of generative models, we propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling, which assigns identifiers to each candidate and treats the generating identifier as the retrieval target. Specifically, we explore an effective coarse-to-fine scheme, combining K-Means and RQ-VAE to discretize multimodal data into token sequences that support autoregressive generation. Further, considering the lack of explicit interaction between queries…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN