AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Jiacheng Shi; Hongfei Du; Xinyuan Song; Y. Alicia Hong; Yanfu Zhang; Ye Gao

arXiv:2605.11098·cs.SD·May 13, 2026

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Jiacheng Shi, Hongfei Du, Xinyuan Song, Y. Alicia Hong, Yanfu Zhang, Ye Gao

PDF

TL;DR

A neural speech codec designed to preserve emotional cues during compression, enhancing expressiveness and naturalness in speech modeling.

Contribution

Introduces an emotion-guided neural speech codec with novel techniques to retain emotional information at the representation level.

Findings

01

Improved emotion consistency in reconstructed speech.

02

Enhanced perceptual quality without losing content accuracy.

03

Effective preservation of emotional cues in downstream tasks.

Abstract

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.