Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality
Jiahan Chen, Da Li, Hengran Zhang, Yinqiong Cai, Lixin Su, Jiafeng Guo, Daiting Shi, Dawei Yin, Keping Bi

TL;DR
This paper introduces CoCoA, a novel content reconstruction pre-training paradigm that enhances multimodal embedding quality by restructuring attention and encouraging semantic compression, leading to more compact and informative representations.
Contribution
The paper proposes CoCoA, a new pre-training method based on collaborative attention and EOS reconstruction, improving the semantic quality of multimodal embeddings.
Findings
CoCoA significantly improves embedding quality on MMEB-V1.
Content reconstruction enhances the semantic compression of multimodal models.
The approach raises the performance ceiling of existing multimodal embedding models.
Abstract
Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
