Probing neural audio codecs for distinctions among English nuclear tunes

Juan Pablo Vigneaux; Jennifer Cole

arXiv:2603.14035·cs.SD·March 17, 2026

Probing neural audio codecs for distinctions among English nuclear tunes

Juan Pablo Vigneaux, Jennifer Cole

PDF

Open Access

TL;DR

This paper investigates whether neural audio codecs used in spoken dialogue models encode pitch patterns of English nuclear tunes, revealing their limitations in capturing nuanced intonational distinctions.

Contribution

It demonstrates that neural audio codecs contain some pitch-related information but are limited in fully representing complex intonational patterns of English nuclear tunes.

Findings

01

Linear probes achieve above-chance accuracy in distinguishing nuclear tune types.

02

Information about tunes is distributed across all codebooks, challenging previous distinctions.

03

Nonlinear probes improve accuracy but still fall short of human performance.

Abstract

State-of-the-art spoken dialogue models (D\'efossez et al. 2024; Schalkwyk et al. 2025) use neural audio codecs to "tokenize" audio signals into a lower-frequency stream of vectorial latent representations, each quantized using a hierarchy of vector codebooks. A transformer layer allows these representations to reflect some time- and context-dependent patterns. We train probes on labeled audio data from Cole et al. (2023) to test whether the pitch trajectories that characterize English phrase-final (nuclear) intonational tunes are among these patterns. Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy (TATA): 0.31) and the five clusters of these tunes that are robust in human speech production and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Emotion and Mood Recognition