Probing neural audio codecs for distinctions among English nuclear tunes
Juan Pablo Vigneaux, Jennifer Cole

TL;DR
This paper investigates whether neural audio codecs used in spoken dialogue models encode pitch patterns of English nuclear tunes, revealing their limitations in capturing nuanced intonational distinctions.
Contribution
It demonstrates that neural audio codecs contain some pitch-related information but are limited in fully representing complex intonational patterns of English nuclear tunes.
Findings
Linear probes achieve above-chance accuracy in distinguishing nuclear tune types.
Information about tunes is distributed across all codebooks, challenging previous distinctions.
Nonlinear probes improve accuracy but still fall short of human performance.
Abstract
State-of-the-art spoken dialogue models (D\'efossez et al. 2024; Schalkwyk et al. 2025) use neural audio codecs to "tokenize" audio signals into a lower-frequency stream of vectorial latent representations, each quantized using a hierarchy of vector codebooks. A transformer layer allows these representations to reflect some time- and context-dependent patterns. We train probes on labeled audio data from Cole et al. (2023) to test whether the pitch trajectories that characterize English phrase-final (nuclear) intonational tunes are among these patterns. Results: Linear probes trained on the unquantized latents or some of the associated codewords yield above-chance accuracy in distinguishing eight phonologically specified nuclear tunes with monotonal pitch accents (top average test accuracy (TATA): 0.31) and the five clusters of these tunes that are robust in human speech production and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Emotion and Mood Recognition
