On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Jeff Chan-Jan Sju; Liang-Hsuan Tseng; Yi-Cheng Lin; Yen-Chun Kuo; Ju-Chieh Chou; Kai-Wei Chang; Hung-yi Lee; Carlos Busso

arXiv:2601.06329·cs.CL·January 13, 2026

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Jeff Chan-Jan Sju, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso

PDF

Open Access

TL;DR

This paper critiques the use of global token perplexity for evaluating spoken language models, proposing alternative metrics that better align with human judgments and reshape model performance comparisons.

Contribution

It introduces new likelihood- and generative-based evaluation methods tailored for speech, addressing limitations of traditional text-based perplexity metrics.

Findings

01

New metrics correlate better with human opinion scores

02

Revised evaluation reduces performance gap between models and humans

03

Traditional perplexity underestimates speech model capabilities

Abstract

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Face recognition and analysis · Emotion and Mood Recognition