WESR: Scaling and Evaluating Word-level Event-Speech Recognition
Chenchen Yang, Kexin Huang, Liwei Fan, Qian Tu, Botian Jiang, Dong Zhang, Linqi Yin, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu

TL;DR
This paper introduces WESR, a comprehensive framework for localizing and evaluating non-verbal vocal events in speech, addressing previous limitations with a refined taxonomy, a new benchmark, and specialized models.
Contribution
It develops a detailed taxonomy of vocal events, creates WESR-Bench for evaluation, and trains models that outperform existing solutions in event localization within speech.
Findings
WESR-Bench enables precise localization of vocal events.
Specialized models surpass open-source and commercial APIs.
Refined taxonomy improves event categorization and detection.
Abstract
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis
