GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Aobo Xia; Shuyu Lei; Yushu Yang; Xiang Guo; Hua Chai

arXiv:2309.02780·cs.CL·September 12, 2023

GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Aobo Xia, Shuyu Lei, Yushu Yang, Xiang Guo, Hua Chai

PDF

Open Access

TL;DR

This paper introduces GRASS, a unified end-to-end model for speech-to-semantic tasks that leverages instruction fine-tuning and large-scale pre-training, achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper presents a novel unified framework for speech-to-semantic tasks using instruction fine-tuning and pre-training with TTS-generated data, advancing zero-shot and few-shot capabilities.

Findings

01

Achieves SOTA results on multiple speech semantic benchmarks.

02

Performs competitively in zero-shot and few-shot scenarios.

03

Provides instruction dataset and code for future research.

Abstract

This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more, after fine-tuning. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques