Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Kaiyan Zhao; Zhongtao Miao; Yoshimasa Tsuruoka

arXiv:2508.00332·cs.CL·August 4, 2025

Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Kaiyan Zhao, Zhongtao Miao, Yoshimasa Tsuruoka

PDF

Open Access

TL;DR

This paper introduces MCSEO, a novel method that improves multimodal sentence embeddings by integrating fine-grained object-phrase alignment, leading to better semantic similarity performance.

Contribution

The paper proposes a new approach that incorporates object-phrase alignment into contrastive learning for multimodal sentence embeddings, addressing noise in image-caption pairs.

Findings

01

MCSEO outperforms strong baselines on STS tasks

02

Fine-grained object-phrase alignment improves embedding quality

03

The method enhances multimodal representation learning

Abstract

Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis