GRASS: Unified Generation Model for Speech-to-Semantic Tasks
Aobo Xia, Shuyu Lei, Yushu Yang, Xiang Guo, Hua Chai

TL;DR
This paper introduces GRASS, a unified end-to-end model for speech-to-semantic tasks that leverages instruction fine-tuning and large-scale pre-training, achieving state-of-the-art results across multiple benchmarks.
Contribution
The paper presents a novel unified framework for speech-to-semantic tasks using instruction fine-tuning and pre-training with TTS-generated data, advancing zero-shot and few-shot capabilities.
Findings
Achieves SOTA results on multiple speech semantic benchmarks.
Performs competitively in zero-shot and few-shot scenarios.
Provides instruction dataset and code for future research.
Abstract
This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more, after fine-tuning. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
