Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model
Jingni Huang, Peter Bloodsworth

TL;DR
This paper explores using facial expression-derived emotion embeddings to improve short-term human pose prediction with a lightweight autoregressive model, showing that emotion signals can enhance prediction accuracy in emotion-driven sequences.
Contribution
It introduces a novel multimodal fusion approach using a gating mechanism within a lightweight predictive world model for emotion-conditioned pose forecasting.
Findings
Normalized gating fusion improves prediction accuracy for emotion-driven sequences.
Emotion embeddings act as auxiliary signals influencing pose prediction.
Counterfactual experiments show trajectory sensitivity to emotion input changes.
Abstract
Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
