OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize, Cheng, Hengshuang Zhao, Zhou Zhao

TL;DR
OmniBind introduces a scalable, efficient multimodal representation model that integrates diverse modalities by remapping pre-trained models into a unified space, enabling versatile applications with minimal training data and time.
Contribution
The paper presents OmniBind, a large-scale multimodal model that leverages space remapping and dynamic routing to efficiently combine multiple pre-trained modality-specific models.
Findings
Supports 3D, audio, image, and language inputs.
Achieves high performance with only unpaired unimodal data.
Trains a 30B model in about 3 days on a single GPU cluster.
Abstract
Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joint representation models ranging in scale from 7 billion to 30 billion parameters, which support 3D, audio, image, and language inputs. Due to the scarcity of data pairs across all modalities, instead of training large models from scratch, we propose remapping and binding the spaces of various pre-trained specialist models together. This approach enables "scaling up" by indirectly increasing the model parameters and the amount of seen data. To effectively integrate various spaces, we…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper effectively addresses the challenge of binding multiple pre-trained multimodal spaces through dynamic weight routing. This approach moves beyond traditional fixed-weight methods by introducing a learnable routing mechanism, enhancing cross-modal alignment and reducing knowledge interference. 2. OmniBind's performance is thoroughly evaluated across diverse benchmarks, covering a wide range of modality combinations. This strengthens the claims of improved generalization and versatili
1. While the model uses pseudo-paired data due to a lack of real-world multimodal data pairs, this raises questions about the validity of the results in real-world applications. This reliance on pseudo-pairs could potentially limit the model's robustness in unexpected scenarios. 2. Although the dynamic routing approach is innovative, it may introduce complexity, particularly in practical implementations or deployment, where computational costs and latency might become prohibitive with increasin
1.The manuscript presents a compelling approach to integrating various multimodal representation models into a unified framework, which is an advancement in the field of human-computer interaction. 2.The introduction of dynamic weight assignment and language representation decoupling may bring benefits. 3.The experimental validation provided in the study partially supports the authors’ claims regarding the versatility and effectiveness of the proposed model.
1. As for the proposed Language Representation Decoupling, statistics on the number of such cases can be added to support the need for such improvements. 2. Regarding the construction of the pseudo pair data for training, for audio, the authors use audio from AudioSet and then employ a state-of-the-art audio-text model to retrieve the most similar texts from AudioCaps and Clotho (they also retrieve texts from other datasets, but as the authors mention, there is a gap in the texts across differen
* Originality This paper presents a novel approach for binding cross-modal pre-trained latent space into a different space. Although similar ideas have been published in several prior work, this paper has made some extensions to enable more efficient pseudo item-pair construction and to be able to scale up the target spaces. * Quality The effectiveness of the proposed framework is validated by extensive experiments over multiple benchmarks, and the model has achieved even better performance
* Arguably the idea presented in this paper is a natural extension thus incremental to previous work, e.g. the main change from FreeBind would be the pseudo-item generation (from pseudo-embedding pair aggregation to pseudo-item pair retrieval) and learnable weights routing. * One missing ablation (correct me if I missed anything) would be a side-by-side comparison of pseudo item pair retrieval vs. pseudo embedding aggregation (e.g. FreeBind), as the authors claim the proposed approach is more
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Video Analysis and Summarization · Speech and dialogue systems
