Multimodal Hidden Markov Models for Persistent Emotional State Tracking
Anamika Ragu, Aneesh Jonelagadda

TL;DR
This paper introduces a lightweight multimodal hidden Markov model framework for tracking persistent emotional states in conversations, enhancing interpretability and efficiency over existing utterance-level emotion recognition methods.
Contribution
It proposes a novel sticky factorial HDP-HMM approach for modeling emotional regimes using multimodal data, improving interpretability and computational efficiency.
Findings
The model predicts more interpretable emotional regimes than Gaussian HMM.
Regimes can be reliably recovered from multimodal valence-arousal trajectories.
Using emotional regimes improves LLM response quality in clinical dialogue contexts.
Abstract
Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
