EBind: a practical approach to space binding
Jim Broadbent, Felix Cohen, Frederik Hvilsh{\o}j, Eric Landau, Eren Sasoglu

TL;DR
EBind presents a simple, data-centric, and efficient method for space binding that enables training high-performance multimodal models on a single GPU within hours, outperforming larger models.
Contribution
The paper introduces EBind, a practical approach that simplifies space binding using a single encoder and high-quality data, achieving state-of-the-art results efficiently.
Findings
A 1.8B-parameter model outperforms larger models by 4 to 17 times.
Curated multimodal datasets significantly improve model performance.
Introduces a new high-quality zero-shot classification benchmark.
Abstract
We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Rather than merely scaling data and model size, the work explores a **parameter-efficient strategy** to achieve superior performance, which represents a more sustainable research direction. 2. EBind can be trained rapidly while attaining performance competitive with OmniBind, demonstrating **high efficiency without sacrificing accuracy**. 3. The paper also provides **practical applications** that showcase real-world utility.
1. The single-GPU and fast-training advantages are largely attributed to very few trainable parameters and extensive data preprocessing. Could PointBind or OmniBind achieve similar results under the same constraints? 2. The training uses **nearly 2M paired samples**. Is the performance primarily due to data scale rather than data quality? How does the dataset size compare to other methods? 3. Can the trained model be integrated with current generative models or LMMs? If so, please provide exampl
- The proposed model achieves its parameter efficiency by utilizing a simple architecture based on frozen, pre-trained encoders for all five modalities. Leading to light weight architecture for simplicity and achieved descent results to be obtained on a single GPU within hours. - The work introduces a consensus-annotated zero-shot classification benchmark, EShot, for the audio-Point Cloud modality pair, alongside the commitment to open-sourcing the code, model weights, and curated datasets for
1. The model demonstrates low performance in cross-modal Audio-Text retrieval tasks (such as AudioCaps and Clotho) compared to larger competitor models, a gap the authors hypothesize is due to their choice of using the ImageBind audio encoder, which was initially optimized against images rather than text. In that case, why did the binding strategy not alleviate such an issue? Why was this specific, weakly-performing encoder chosen over potentially stronger, text-aligned audio encoders like CLAP
The paper is innovative, designing a new model architecture and filling the gap in audio-point cloud data evaluation. The structure of the paper is clear, with intuitive figures and a detailed appendix, including annotation examples. The experimental quality of the paper is good, but some ablation experiments are missing. The significance of the paper is notable.
1. Since the visual-text model is frozen, the improvement in visual capabilities relative to other models entirely depends on the selection of a new base model. In fact, to make a fair comparison, it would be best to add an ablation experiment using CLIP-L from ImageBind as the visual-text model. This would not diminish the superiority of EBind in current setting. 2. As mentioned above, the comparison with Ex-MCR is unfair and incomplete. Ex-MCR provides a version based on CLIP-L and combines i
1. The proposed EBind is somehow parameter efficient. 2. The three complementary data sources is a good contribution which inlcude 6.7 M fully-annotated multimodal quintuples, 1 M diverse semi-automated triples annotated by humans, and 3.4 M pre-existing captioned data items.
1. This paper is more like a "Dataset & Benchmark Track" submission, the reviewer believe the technical contribution is limited. 2. In the experiment section, only several Bind model are included, more should be compared and discussed: [A] Girdhar R, El-Nouby A, Liu Z, et al. Imagebind: One embedding space to bind them all[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 15180-15190. [B] Zhu B, Lin B, Ning M, et al. LanguageBind: Extending Video-La
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Multimodal Machine Learning Applications
