Edge-Cloud Collaborative Speech Emotion Captioning via Token-Level Speculative Decoding in Audio-Language Models

Xiangyuan Xue; Jiajun Lu; Yan Gao; Gongping Huang; Ting Dang; Hong Jia

arXiv:2603.11397·cs.SD·March 13, 2026

Edge-Cloud Collaborative Speech Emotion Captioning via Token-Level Speculative Decoding in Audio-Language Models

Xiangyuan Xue, Jiajun Lu, Yan Gao, Gongping Huang, Ting Dang, Hong Jia

PDF

Open Access

TL;DR

This paper introduces an edge-cloud collaborative approach for speech emotion captioning that balances computational efficiency, privacy, and caption quality using uncertainty-guided speculative decoding, significantly improving performance and speed.

Contribution

It proposes a novel Uncertainty-Guided Speculative Decoding framework that enhances on-device speech emotion captioning by selectively involving cloud verification, improving efficiency and accuracy.

Findings

01

Up to 62.7% BLEU score improvement

02

1.4x lower latency compared to edge-only models

03

8.5x higher token throughput

Abstract

Speech Emotion Captioning (SEC) leverages large audio-language models to generate rich, context-aware affective descriptions from speech. However, real-world deployment remains challenging due to the substantial computational demands on resource-constrained edge devices and the privacy risks of transmitting biometric audio. While smaller audio-language models enable efficient on-device SEC, their limited capacity often weakens subtle paralinguistic modeling and fine-grained affective grounding. We propose an edge-cloud collaborative framework based on Uncertainty-Guided Speculative Decoding (UGSD). A lightweight edge model drafts captions locally, and only high-uncertainty token blocks are selectively escalated to a stronger cloud verifier for validation. Experiments on the MER2024 benchmark demonstrate substantial BLEU improvements up to 62.7%. UGSD further achieves 1.4x lower latency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Emotion and Mood Recognition · Music and Audio Processing