Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
Juergen Dietrich

TL;DR
This study compares acoustic emotion recognition models and LLM-based multimodal analysis for political speech emotion detection, finding LLMs better capture semantic political emotions than acoustic models alone.
Contribution
It demonstrates that LLM-based multimodal analysis correlates more strongly with political emotion scores than traditional acoustic models, highlighting the importance of semantic context.
Findings
Gemini LLM's Valence correlates strongly with TRUST-Pathos scores (rho=+0.664)
Acoustic emotion models show weak correlation with political emotion scores
Standard SER datasets have biases and limitations for political speech analysis
Abstract
We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
