BQA: Body Language Question Answering Dataset for Video Large Language Models
Shintaro Ozaki, Kazuki Hayashi, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

TL;DR
This paper introduces BQA, a new dataset for evaluating Video Large Language Models' ability to interpret emotions from body language in videos, highlighting current challenges and biases in understanding nonverbal cues.
Contribution
The paper presents BQA, a novel dataset for body language question answering, and analyzes the performance and biases of existing VideoLLMs on this dataset.
Findings
Understanding body language remains challenging for VideoLLMs.
Certain models show bias based on age and ethnicity.
The dataset reveals gaps in current VideoLLMs' nonverbal understanding.
Abstract
A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
