WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Chenchen Yang; Kexin Huang; Liwei Fan; Qian Tu; Botian Jiang; Dong Zhang; Linqi Yin; Shimin Li; Zhaoye Fei; Qinyuan Cheng; and Xipeng Qiu

arXiv:2601.04508·cs.CL·January 9, 2026

WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Chenchen Yang, Kexin Huang, Liwei Fan, Qian Tu, Botian Jiang, Dong Zhang, Linqi Yin, Shimin Li, Zhaoye Fei, Qinyuan Cheng, and Xipeng Qiu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces WESR, a comprehensive framework for localizing and evaluating non-verbal vocal events in speech, addressing previous limitations with a refined taxonomy, a new benchmark, and specialized models.

Contribution

It develops a detailed taxonomy of vocal events, creates WESR-Bench for evaluation, and trains models that outperform existing solutions in event localization within speech.

Findings

01

WESR-Bench enables precise localization of vocal events.

02

Specialized models surpass open-source and commercial APIs.

03

Refined taxonomy improves event categorization and detection.

Abstract

Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yfish/WESR-Bench
dataset· 20 dl
20 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Emotion and Mood Recognition · Speech Recognition and Synthesis