Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
Dehua Tao, Xuan Luo, Daxin Tan, Kai Chen, Lanqing Hong, Jing Li, Ruifeng Xu, Xiao Chen

TL;DR
Speech-Omni-Lite is a lightweight, cost-effective framework that extends vision-language models with speech understanding and generation capabilities using minimal training data, maintaining high performance and transferability.
Contribution
It introduces a novel plug-and-play module design and a low-cost data construction strategy to add speech functions to VL models without retraining the entire backbone.
Findings
Achieves high-quality spoken QA with only thousands of hours of speech data.
Maintains vision-language performance while adding speech capabilities.
Modules transfer effectively across different VL backbones.
Abstract
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Topic Modeling
