Learning Triadic Belief Dynamics in Nonverbal Communication from Videos
Lifeng Fan, Shuwen Qiu, Zilong Zheng, Tao Gao, Song-Chun Zhu, Yixin, Zhu

TL;DR
This paper introduces a novel model that captures nonverbal cues and belief dynamics among agents in videos, enabling better understanding of social interactions and improved video summarization.
Contribution
It presents a new hierarchical energy-based model that infers agents' beliefs and true states, forming a 'common mind' from nonverbal cues, advancing scene understanding in social contexts.
Findings
Improved video summarization on social interaction videos
Effective modeling of belief dynamics and nonverbal cues
Outperforms state-of-the-art keyframe methods
Abstract
Humans possess a unique social cognition capability; nonverbal communication can convey rich social information among agents. In contrast, such crucial social characteristics are mostly missing in the existing scene understanding literature. In this paper, we incorporate different nonverbal communication cues (e.g., gaze, human poses, and gestures) to represent, model, learn, and infer agents' mental states from pure visual inputs. Crucially, such a mental representation takes the agent's belief into account so that it represents what the true world state is and infers the beliefs in each agent's mental state, which may differ from the true world states. By aggregating different beliefs and true world states, our model essentially forms "five minds" during the interactions between two agents. This "five minds" model differs from prior works that infer beliefs in an infinite recursion;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition
