Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

Shiwon Kim; Yu Rang Park

arXiv:2605.16834·cs.CV·May 19, 2026

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

Shiwon Kim, Yu Rang Park

PDF

TL;DR

This paper introduces a novel post-hoc multimodal alignment method that learns token-level relations using relative representations, improving fine-grained cross-modal tasks with limited paired data.

Contribution

It proposes a new approach that models token-level cross-modal structure via learnable anchors, outperforming existing methods in various zero-shot tasks.

Findings

01

Outperforms existing methods in zero-shot classification

02

Achieves superior cross-modal retrieval results

03

Enhances zero-shot segmentation accuracy

Abstract

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.