Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Minoh Jeong, Zae Myung Kim, Min Namgung, Dongyeop Kang, Yao-Yi Chiang, Alfred Hero

TL;DR
This paper analyzes fixed anchor binding methods in multi-modal learning, identifies their limitations, and proposes an adaptive centroid-based framework, CentroBind, that improves the quality of unified multi-modal representations.
Contribution
It introduces a novel adaptive anchor binding framework, CentroBind, which addresses limitations of fixed anchors by leveraging all modalities for better multi-modal representations.
Findings
CentroBind outperforms fixed anchor methods on synthetic and real datasets.
Theoretical analysis confirms CentroBind captures intra-modal, inter-modal, and alignment properties.
Adaptive anchors lead to more balanced and comprehensive multi-modal representations.
Abstract
A unified representation space in multi-modal learning is essential for effectively integrating diverse data sources, such as text, images, and audio, to enhance efficiency and performance across various downstream tasks. Recent binding methods, such as ImageBind, typically rely on a single, fixed anchor modality for aligning multi-modal data. We mathematically analyze these fixed anchor binding methods and uncover significant limitations: (1) over-reliance on the choice of the anchor modality, (2) inadequate capture of intra-modal information, and (3) failure to account for cross-modal correlation among non-anchored modalities. To address these issues, we propose the need for adaptive anchor binding methods, exemplified by our framework CentroBind. The proposed method uses adaptively adjustable centroid-based anchors generated from all available modalities, leading to a balanced and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is clearly written and motivated by a relevant problem in multi-modal learning — the bias and inefficiency of fixed-anchor alignment. 2. The proposed centroid-based adaptive anchor idea is simple, easy to implement, and potentially applicable to other multi-modal frameworks. 3. Experiments are conducted on both synthetic and real-world datasets, covering multiple modalities and tasks.
1. The theoretical novelty is relatively weak. (1) The key idea—constructing a centroid anchor from multiple modalities—is a minor variation of existing multi-modal alignment formulations. (2) Similar concepts of adaptive or learned anchors have been discussed in OmniBind[1] and UNIALIGN[2]. (3) The mathematical derivations in Section 3 mostly restate standard InfoNCE lower-bound properties without offering new theoretical insights or proofs that go beyond prior work. The formal results (T
- Clear and Well-Structured: The paper is well-organized, with detailed explanations of the preliminary, intuition, and methodology. - Superiority in Alignment: The experimental results demonstrate that the proposed method achieves the best performance on the cross-modal retrieval and classification tasks compared to the baselines.
- The paper provides clear intuition but presents the preliminary and methodological sections in an overly complex manner. I suggest that the authors reorganize the presentation flow to enhance readability and logical coherence. From my perspective, it is not necessary to include too many theoretical derivations or formal statements in the main text—these could be moved to the appendix, while keeping the main body focused on the core ideas, motivation, and empirical insights. - If some modalit
- Theoretical justification for the problem is solid, in standard multimodal contrastive learning the choice of anchor modality imposes a fixed ceiling - The propose CENTROBIND method is simple: compute a per-batch centroid and align to it, no additional architecture, so in principle you can drop it into existing multimodal contrastive setups
- Assumption that a single centroid per batch is a good proxy for the "true" shared semantics. - The method is also batch-dependent: the quality and stability of the anchor will depend on what modalities are present and how balanced the batch is. - Empirical evaluation is very limited, only consisting of results on a synthetic dataset and some limited set of tasks like sarcasm and speaker classification, dreambooth (image editing ?). Audioset is the only result comparable to prior baseline paper
- This paper studies a practical problem where the fixed anchor could be limited in some cases. - A theoretical framework is proposed to support the method.
- The anchor generation strategy in CentroBind, which averages modality centroids, may not be robust when different modalities exhibit varying information densities. Modalities containing more or less discriminative information could disproportionately influence the centroid, potentially leading to biased or unbalanced representations. - The proposed method relies on high independence of modalities, which is not true in the real world. When modalities are highly correlated or exhibit strong syne
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Natural Language Processing Techniques · Neural Networks and Applications
MethodsALIGN
