Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data
Shiwon Kim, Yu Rang Park

TL;DR
This paper introduces a novel post-hoc multimodal alignment method that learns token-level relations using relative representations, improving fine-grained cross-modal tasks with limited paired data.
Contribution
It proposes a new approach that models token-level cross-modal structure via learnable anchors, outperforming existing methods in various zero-shot tasks.
Findings
Outperforms existing methods in zero-shot classification
Achieves superior cross-modal retrieval results
Enhances zero-shot segmentation accuracy
Abstract
Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
