Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval

Ruofan Hu; Yan Xia; Minjie Hong; Jieming Zhu; Bo Chen; Xiaoda Yang; Minghui Fang; Tao Jin

arXiv:2506.14445·cs.IR·June 18, 2025

Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval

Ruofan Hu, Yan Xia, Minjie Hong, Jieming Zhu, Bo Chen, Xiaoda Yang, Minghui Fang, Tao Jin

PDF

Open Access

TL;DR

Vela is a novel framework that adapts multimodal large language models to generate universal multimodal embeddings, improving text-audio retrieval and handling complex retrieval tasks more effectively than traditional models.

Contribution

Vela introduces a single-modality training approach with prompt engineering to adapt MLLMs for multimodal embeddings, surpassing existing models in retrieval tasks.

Findings

01

Vela outperforms traditional CLAP models in text-audio retrieval.

02

Vela handles long texts and complex retrieval tasks more robustly.

03

New benchmarks reveal limitations of CLAP models.

Abstract

Multimodal large language models (MLLMs) have seen substantial progress in recent years. However, their ability to represent multimodal information in the acoustic domain remains underexplored. In this work, we introduce Vela, a novel framework designed to adapt MLLMs for the generation of universal multimodal embeddings. By leveraging MLLMs with specially crafted prompts and selected in-context learning examples, Vela effectively bridges the modality gap across various modalities. We then propose a single-modality training approach, where the model is trained exclusively on text pairs. Our experiments show that Vela outperforms traditional CLAP models in standard text-audio retrieval tasks. Furthermore, we introduce new benchmarks that expose CLAP models' limitations in handling long texts and complex retrieval tasks. In contrast, Vela, by harnessing the capabilities of MLLMs,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques