Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information

Nicholas Sanders; Yuanchao Li; Korin Richmond; Simon King

arXiv:2505.15667·eess.AS·May 22, 2025

Segmentation-Variant Codebooks for Preservation of Paralinguistic and Prosodic Information

Nicholas Sanders, Yuanchao Li, Korin Richmond, Simon King

PDF

Open Access

TL;DR

This paper introduces Segmentation-Variant Codebooks (SVCs) that quantize speech at different linguistic levels to better preserve prosodic and paralinguistic features in speech compression and synthesis tasks.

Contribution

The paper proposes a novel segmentation-variant codebook approach that factorizes speech into multiple segment-specific streams, improving preservation of paralinguistic information.

Findings

01

SVCs outperform traditional codebooks in preserving prosodic features.

02

Pooling before discretization enhances segment-level information retention.

03

Resynthesis with SVCs improves style and quality while maintaining intelligibility.

Abstract

Quantization in SSL speech models (e.g., HuBERT) improves compression and performance in tasks like language modeling, resynthesis, and text-to-speech but often discards prosodic and paralinguistic information (e.g., emotion, prominence). While increasing codebook size mitigates some loss, it inefficiently raises bitrates. We propose Segmentation-Variant Codebooks (SVCs), which quantize speech at distinct linguistic units (frame, phone, word, utterance), factorizing it into multiple streams of segment-specific discrete features. Our results show that SVCs are significantly more effective at preserving prosodic and paralinguistic information across probing tasks. Additionally, we find that pooling before rather than after discretization better retains segment-level information. Resynthesis experiments further confirm improved style realization and slightly improved quality while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques