AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning
Yiwen Shao, Wei Liu, Jiahong Li, Tianzi Wang, Kun Wei, Meng Yu, Dong Yu

TL;DR
This paper introduces AZeroS, a speech-LLM trained with a novel Self-Generated Instruction-Free Tuning paradigm, enabling better generalization to unseen tasks without task-specific data collection.
Contribution
The paper proposes SIFT, a new training paradigm for speech-LLMs that eliminates the need for task-specific data, and introduces AZeroS, a model leveraging this paradigm with minimal training cost.
Findings
AZeroS achieves state-of-the-art results on multiple benchmarks.
SIFT improves generalization to unseen speech tasks.
Minimal training cost with high performance.
Abstract
Extending large language models (LLMs) to the speech domain has recently gained significant attention. A typical approach connects a pretrained LLM with an audio encoder through a projection module and trains the resulting model on large-scale, task-specific instruction-tuning datasets. However, curating such instruction-tuning data for specific requirements is time-consuming, and models trained in this manner often generalize poorly to unseen tasks. In this work, we first formulate that the strongest generalization of a speech-LLM is achieved when it is trained with Self-Generated Instruction-Free Tuning (SIFT), in which supervision signals are generated by a frozen LLM using textual representations of speech as input. Our proposed SIFT paradigm eliminates the need for collecting task-specific question-answer pairs and yields the theoretically best generalization to unseen tasks.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
