OmniBind: Teach to Build Unequal-Scale Modality Interaction for   Omni-Bind of All

Yuanhuiyi Lyu; Xu Zheng; Dahun Kim; Lin Wang

arXiv:2405.16108·cs.CV·May 28, 2024·2 cites

OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All

Yuanhuiyi Lyu, Xu Zheng, Dahun Kim, Lin Wang

PDF

Open Access

TL;DR

OmniBind introduces a two-stage framework enabling flexible multi-modal learning by aligning diverse modalities with a well-trained teacher, allowing effective fusion and recognition across any modality combinations, even with unequal scales.

Contribution

The paper presents OmniBind, a novel framework that allows any combination of modalities to be fused and learned, addressing scale and mismatch issues through cross-modal alignment and adaptive fusion.

Findings

01

Achieves 4.05% average performance gain over prior methods on arbitrary modality combinations.

02

Sets new state-of-the-art for single modality recognition, e.g., touch with 4.34% improvement.

03

Develops the first dataset combining teacher and student modalities for omni-bind evaluation.

Abstract

Research on multi-modal learning dominantly aligns the modalities in a unified space at training, and only a single one is taken for prediction at inference. However, for a real machine, e.g., a robot, sensors could be added or removed at any time. Thus, it is crucial to enable the machine to tackle the mismatch and unequal-scale problems of modality combinations between training and inference. In this paper, we tackle these problems from a new perspective: "Modalities Help Modalities". Intuitively, we present OmniBind, a novel two-stage learning framework that can achieve any modality combinations and interaction. It involves teaching data-constrained, a.k.a, student, modalities to be aligned with the well-trained data-abundant, a.k.a, teacher, modalities. This subtly enables the adaptive fusion of any modalities to build a unified representation space for any combinations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInteractive and Immersive Displays · Robotics and Automated Systems · Context-Aware Activity Recognition Systems

MethodsALIGN