AlignCap: Aligning Speech Emotion Captioning to Human Preferences
Ziqi Liang, Haoxiang Shi, Hanhui Chen

TL;DR
AlignCap introduces a novel speech emotion captioning method that aligns with human preferences and reduces hallucinations by leveraging large language models with specialized regularizations, improving zero-shot performance.
Contribution
The paper proposes AlignCap, a new SEC approach that uses knowledge distillation and preference optimization to enhance caption accuracy and reduce hallucinations, especially in zero-shot scenarios.
Findings
Outperforms state-of-the-art SEC methods in zero-shot tasks
Reduces hallucinations and improves factuality in speech emotion captions
Effective use of large language models with regularization techniques
Abstract
Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Speech and dialogue systems
MethodsKnowledge Distillation
