Multimodal Representation Learning Conditioned on Semantic Relations

Yang Qiao; Yuntong Hu; Bowen Zhu; Hasibul Haque; Liang Zhao

arXiv:2508.17497·cs.LG·May 12, 2026

Multimodal Representation Learning Conditioned on Semantic Relations

Yang Qiao, Yuntong Hu, Bowen Zhu, Hasibul Haque, Liang Zhao

PDF

TL;DR

This paper introduces RCML, a framework for multimodal learning that conditions representations on semantic relations, enabling more context-aware embeddings for diverse tasks.

Contribution

It proposes a relation-conditioned approach that explicitly incorporates semantic relations into multimodal embeddings, improving flexibility and performance.

Findings

01

RCML outperforms baselines on retrieval and classification tasks.

02

It effectively models relation-dependent multimodal data.

03

The approach enhances zero-shot and out-of-domain generalization.

Abstract

Multimodal representation learning has been largely driven by contrastive models such as CLIP, which learn a shared embedding space by aligning paired image-text samples. While effective for general-purpose representation learning, such models typically produce a single embedding per sample that is reused across different semantic relations and contexts. However, in many real-world applications, relevance between samples is inherently relation-dependent, with different semantic relations emphasizing different aspects of multimodal data. In this work, we propose Relation-Conditioned Multimodal Learning (RCML), a framework that treats semantic relations as explicit conditions of multimodal representation learning. Rather than producing relation-agnostic embeddings, RCML learns representations conditioned on natural-language relation descriptions, allowing the same sample to be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.