Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

Saurabh Kataria; Xiao Hu

arXiv:2603.23057·eess.AS·March 25, 2026

Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

Saurabh Kataria, Xiao Hu

PDF

Open Access

TL;DR

This paper introduces ZS-Fuse, a novel late-fusion approach combining zero-shot audio-language model estimates with specialist foundation models, enhanced by prompt amplification, to improve speech emotion recognition performance.

Contribution

It proposes a new late-fusion method with prompt amplification for zero-shot speech emotion recognition, demonstrating improved results over existing baselines.

Findings

01

ZS-Fuse outperforms SOTA baselines on multiple datasets.

02

Prompt amplification enhances zero-shot emotion detection.

03

Combining ALMs with specialist FMs yields better SER accuracy.

Abstract

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis