AlignCap: Aligning Speech Emotion Captioning to Human Preferences

Ziqi Liang; Haoxiang Shi; Hanhui Chen

arXiv:2410.19134·cs.CL·October 28, 2024

AlignCap: Aligning Speech Emotion Captioning to Human Preferences

Ziqi Liang, Haoxiang Shi, Hanhui Chen

PDF

Open Access 1 Video

TL;DR

AlignCap introduces a novel speech emotion captioning method that aligns with human preferences and reduces hallucinations by leveraging large language models with specialized regularizations, improving zero-shot performance.

Contribution

The paper proposes AlignCap, a new SEC approach that uses knowledge distillation and preference optimization to enhance caption accuracy and reduce hallucinations, especially in zero-shot scenarios.

Findings

01

Outperforms state-of-the-art SEC methods in zero-shot tasks

02

Reduces hallucinations and improves factuality in speech emotion captions

03

Effective use of large language models with regularization techniques

Abstract

Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AlignCap: Aligning Speech Emotion Captioning to Human Preferences· underline

Taxonomy

TopicsSubtitles and Audiovisual Media · Multimodal Machine Learning Applications · Speech and dialogue systems

MethodsKnowledge Distillation