TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

Zehan Li; Hongjie Chen; Qing Wang; Yuxin Zhang; Jing Zhou; Hang Lv; Mengjie Du; Yaodong Song; Jie Lian; Jian Kang; Jie Li; Yongxiang Li; Xuelong Li

arXiv:2507.18061·cs.CL·January 13, 2026

TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

Zehan Li, Hongjie Chen, Qing Wang, Yuxin Zhang, Jing Zhou, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Xuelong Li

PDF

Open Access

TL;DR

TELEVAL is a new dynamic benchmark for evaluating Chinese spoken language models in realistic, user-centered interaction scenarios, focusing on content understanding and interactional appropriateness.

Contribution

It introduces TELEVAL, a benchmark that emphasizes natural, interactional aspects of spoken language models, addressing limitations of existing task-focused evaluation methods.

Findings

01

Current SLMs excel in content comprehension but lack natural interaction skills.

02

Models struggle with producing socially appropriate and colloquial responses.

03

TELEVAL highlights the gap between semantic accuracy and natural conversational behavior.

Abstract

Spoken language models (SLMs) have advanced rapidly in recent years, accompanied by a growing number of evaluation benchmarks. However, most existing benchmarks emphasize task completion and capability scaling, while remaining poorly aligned with how users interact with SLMs in real-world spoken conversations. Effective spoken interaction requires not only accurate understanding of user intent and content, but also the ability to respond with appropriate interactional strategies. In this paper, we present TELEVAL, a dynamic, user-centered benchmark for evaluating SLMs in realistic Chinese spoken interaction scenarios. TELEVAL consolidates evaluation into two core aspects. Reliable Content Fulfillment assesses whether models can comprehend spoken inputs and produce semantically correct responses. Interactional Appropriateness evaluates whether models act as socially capable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques