Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

Jiahan Chen; Da Li; Hengran Zhang; Yinqiong Cai; Lixin Su; Jiafeng Guo; Daiting Shi; Dawei Yin; Keping Bi

arXiv:2603.01471·cs.IR·March 3, 2026

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

Jiahan Chen, Da Li, Hengran Zhang, Yinqiong Cai, Lixin Su, Jiafeng Guo, Daiting Shi, Dawei Yin, Keping Bi

PDF

Open Access

TL;DR

This paper introduces CoCoA, a novel content reconstruction pre-training paradigm that enhances multimodal embedding quality by restructuring attention and encouraging semantic compression, leading to more compact and informative representations.

Contribution

The paper proposes CoCoA, a new pre-training method based on collaborative attention and EOS reconstruction, improving the semantic quality of multimodal embeddings.

Findings

01

CoCoA significantly improves embedding quality on MMEB-V1.

02

Content reconstruction enhances the semantic compression of multimodal models.

03

The approach raises the performance ceiling of existing multimodal embedding models.

Abstract

Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks