BQA: Body Language Question Answering Dataset for Video Large Language Models

Shintaro Ozaki; Kazuki Hayashi; Miyu Oba; Yusuke Sakai; Hidetaka Kamigaito; Taro Watanabe

arXiv:2410.13206·cs.CL·August 20, 2025

BQA: Body Language Question Answering Dataset for Video Large Language Models

Shintaro Ozaki, Kazuki Hayashi, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

PDF

Open Access 1 Datasets

TL;DR

This paper introduces BQA, a new dataset for evaluating Video Large Language Models' ability to interpret emotions from body language in videos, highlighting current challenges and biases in understanding nonverbal cues.

Contribution

The paper presents BQA, a novel dataset for body language question answering, and analyzes the performance and biases of existing VideoLLMs on this dataset.

Findings

01

Understanding body language remains challenging for VideoLLMs.

02

Certain models show bias based on age and ethnicity.

03

The dataset reveals gaps in current VideoLLMs' nonverbal understanding.

Abstract

A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding. Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent. To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language. We evaluated various VideoLLMs on BQA and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

naist-nlp/BQA
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization