On the Evaluation of Speech Foundation Models for Spoken Language   Understanding

Siddhant Arora; Ankita Pasad; Chung-Ming Chien; Jionghao Han; Roshan; Sharma; Jee-weon Jung; Hira Dhamyal; William Chen; Suwon Shon; Hung-yi Lee,; Karen Livescu; Shinji Watanabe

arXiv:2406.10083·cs.CL·June 17, 2024

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan, Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee,, Karen Livescu, Shinji Watanabe

PDF

Open Access 1 Video

TL;DR

This paper evaluates various speech foundation models on complex spoken language understanding tasks, revealing that self-supervised models often perform as well or better than supervised ones, and compares different integration strategies.

Contribution

It provides a comprehensive evaluation of multiple SFMs and integration methods on SLU tasks, and introduces the SLUE-PERB benchmark and toolkit.

Findings

01

Self-supervised SFMs perform comparably or better than supervised SFMs.

02

Complex prediction heads generally yield better performance.

03

No single best approach for all tasks; trade-offs exist.

Abstract

The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Evaluation of Speech Foundation Models for Spoken Language Understanding· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems