VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation
Yancheng Wang, Osama Hanna, Ruiming Xie, Xianfeng Rui, Maohao Shen, Xuedong Zhang, Christian Fuegen, Jilong Wu, Debjyoti Paul, Arthur Guo, Zhihong Lei, Ozlem Kalinli, Qing He, Yingzhen Yang

TL;DR
VowelPrompt enhances speech emotion recognition by integrating vowel-level prosodic features into large language models, improving interpretability and performance across diverse datasets and conditions.
Contribution
The paper introduces VowelPrompt, a novel framework that incorporates vowel-level prosodic cues into LLMs for more accurate and interpretable speech emotion recognition.
Findings
Outperforms state-of-the-art methods in zero-shot and cross-domain settings.
Enables interpretable explanations grounded in prosody and semantics.
Shows robustness across multiple languages and speaker variations.
Abstract
Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language…
Peer Reviews
Decision·ICLR 2026 Poster
The proposed method has several strengths, which can be grouped into two categories: **Conceptual**: the method provides a privacy-oriented, interpretable formulation of SER by augmenting transcripts with symbolic, vowel-level prosody tokens (F0 level, intensity, duration), allowing emotion inference without raw audio at inference. This design is linguistically grounded (vowels are typically stable prosodic carriers), separates lexical content from paralinguistic cues, and supports closed-set c
Because the oracle used for RLVR traces (GPT-4o) is itself an LLM, there is a material risk that fine-tuned student LLMs learn spurious lexical or formatting heuristics unrelated to the intended prosodic mechanism. To establish that the model uses vowel-level prosodic tokens rather than incidental cues, please include controlled counterfactual ablations that preserve input statistics while breaking the hypothesized channel. Some examples to test: - Transcript shuffle control: randomly permute wo
1. Fine-grained modeling unit: The vowel-centric, phoneme-level prosodic prompting is a clear step beyond sentence-level prosody features, improving both recognition performance and interpretability. The focus on vowels is linguistically motivated and empirically supported. 2. Two-stage training design: The two-stage framework is logically structured, and the RL-based targeted optimization appears to contribute to robust cross-domain generalization. 3. Breadth of evaluation: The empirical study
1. The paper emphasizes the primacy of vowels for prosody, while citing evidence that consonants can convey complementary emotional cues (e.g., Bitouk et al., 2010). However, the method entirely excludes consonant segments. This design choice risks discarding potentially informative signals (e.g., frication intensity, voicing onsets, burst characteristics) that may be emotion-sensitive in certain languages and speaking styles. A controlled analysis is needed to justify the exclusion. 2. The base
The paper introduces pitch, energy, and duration based descriptors from time-aligned vowel segments to generate emotion-salient prompts to improve emotion recognition performance using LLMs. The process of generating the descriptors is well described and results demonstrate the promise of the proposed work. Evaluation from multiple datasets demonstrate the generalization of the findings.
The proposed approach introduces prosodic descriptors based on their proposed approach, that are generated from standard benchmark datasets, it is not clear whether the authors intend to share the information with the community as that would help in both replication of the reported results and foster future research directions. It is not clear whether the setup used to generate the descriptors, or the descriptors obtained from the five datasets (used in the paper) will be publicly shared. There
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Stuttering Research and Treatment
