Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on   Spoken Language Understanding

Siddhant Arora; Alissa Ostapenko; Vijay Viswanathan; Siddharth Dalmia,; Florian Metze; Shinji Watanabe; Alan W Black

arXiv:2106.15065·cs.CL·June 30, 2021

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding

Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia,, Florian Metze, Shinji Watanabe, Alan W Black

PDF

TL;DR

This paper introduces a framework for creating robust, sub-task-specific test sets for decomposable tasks like spoken language understanding, enabling more accurate evaluation of end-to-end models by revealing performance differences hidden in traditional benchmarks.

Contribution

The authors propose a novel method to construct targeted test sets for each sub-task in decomposable tasks, improving the evaluation of end-to-end models and revealing performance gaps.

Findings

01

Generated new test splits for spoken language understanding datasets.

02

Identified up to 10% performance gaps between models.

03

Provided tools and datasets for community use.

Abstract

Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.