Multimodal Representation Learning Conditioned on Semantic Relations
Yang Qiao, Yuntong Hu, Bowen Zhu, Hasibul Haque, Liang Zhao

TL;DR
This paper introduces RCML, a framework for multimodal learning that conditions representations on semantic relations, enabling more context-aware embeddings for diverse tasks.
Contribution
It proposes a relation-conditioned approach that explicitly incorporates semantic relations into multimodal embeddings, improving flexibility and performance.
Findings
RCML outperforms baselines on retrieval and classification tasks.
It effectively models relation-dependent multimodal data.
The approach enhances zero-shot and out-of-domain generalization.
Abstract
Multimodal representation learning has been largely driven by contrastive models such as CLIP, which learn a shared embedding space by aligning paired image-text samples. While effective for general-purpose representation learning, such models typically produce a single embedding per sample that is reused across different semantic relations and contexts. However, in many real-world applications, relevance between samples is inherently relation-dependent, with different semantic relations emphasizing different aspects of multimodal data. In this work, we propose Relation-Conditioned Multimodal Learning (RCML), a framework that treats semantic relations as explicit conditions of multimodal representation learning. Rather than producing relation-agnostic embeddings, RCML learns representations conditioned on natural-language relation descriptions, allowing the same sample to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
