Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding
Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia,, Florian Metze, Shinji Watanabe, Alan W Black

TL;DR
This paper introduces a framework for creating robust, sub-task-specific test sets for decomposable tasks like spoken language understanding, enabling more accurate evaluation of end-to-end models by revealing performance differences hidden in traditional benchmarks.
Contribution
The authors propose a novel method to construct targeted test sets for each sub-task in decomposable tasks, improving the evaluation of end-to-end models and revealing performance gaps.
Findings
Generated new test splits for spoken language understanding datasets.
Identified up to 10% performance gaps between models.
Provided tools and datasets for community use.
Abstract
Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
