CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension
Lihao Liu, Yan Wang, Biao Yang, Da Li, Jiangxia Cao, Yuxiao Luo, Xiang Chen, Xiangyu Wu, Wei Yuan, Fan Yang, Guiguang Ding, Tingting Gao, Guorui Zhou

TL;DR
CREM is a unified model that enhances multimodal representations for retrieval tasks while maintaining generative capabilities, using compression-based prompts and training strategies to align cross-modal semantics.
Contribution
The paper introduces CREM, a novel framework that combines compression-driven prompts and training to improve retrieval and comprehension in multimodal large language models.
Findings
Achieves state-of-the-art retrieval performance on MMEB.
Maintains strong generative performance on multiple benchmarks.
Generative supervision enhances representation quality.
Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and embedding tasks fundamentally rely on shared cognitive mechanisms, specifically cross-modal representation alignment and contextual comprehension. To this end, we propose CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability. Specifically, we introduce a compression-based prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
