UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions
Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan, Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

TL;DR
UniverSLU is a unified speech understanding model that uses natural language instructions to perform multiple tasks across various datasets and languages, often surpassing task-specific models.
Contribution
The paper introduces UniverSLU, a multi-task speech understanding model that leverages instruction tuning and natural language prompts to handle diverse SLU tasks in a single framework.
Findings
Achieves competitive or superior performance on 12 speech tasks across 17 datasets.
Generalizes well to new datasets and languages in zero-shot settings.
Effectively uses natural language instructions for task specification.
Abstract
Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Speech and dialogue systems
