OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
Yuanhuiyi Lyu, Xu Zheng, Dahun Kim, Lin Wang

TL;DR
OmniBind introduces a two-stage framework enabling flexible multi-modal learning by aligning diverse modalities with a well-trained teacher, allowing effective fusion and recognition across any modality combinations, even with unequal scales.
Contribution
The paper presents OmniBind, a novel framework that allows any combination of modalities to be fused and learned, addressing scale and mismatch issues through cross-modal alignment and adaptive fusion.
Findings
Achieves 4.05% average performance gain over prior methods on arbitrary modality combinations.
Sets new state-of-the-art for single modality recognition, e.g., touch with 4.34% improvement.
Develops the first dataset combining teacher and student modalities for omni-bind evaluation.
Abstract
Research on multi-modal learning dominantly aligns the modalities in a unified space at training, and only a single one is taken for prediction at inference. However, for a real machine, e.g., a robot, sensors could be added or removed at any time. Thus, it is crucial to enable the machine to tackle the mismatch and unequal-scale problems of modality combinations between training and inference. In this paper, we tackle these problems from a new perspective: "Modalities Help Modalities". Intuitively, we present OmniBind, a novel two-stage learning framework that can achieve any modality combinations and interaction. It involves teaching data-constrained, a.k.a, student, modalities to be aligned with the well-trained data-abundant, a.k.a, teacher, modalities. This subtly enables the adaptive fusion of any modalities to build a unified representation space for any combinations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInteractive and Immersive Displays · Robotics and Automated Systems · Context-Aware Activity Recognition Systems
MethodsALIGN
