MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Kangsan Kim; Yanlai Yang; Suji Kim; Woongyeong Yeo; Youngwan Lee; Mengye Ren; Sung Ju Hwang

arXiv:2603.09827·cs.CV·March 12, 2026

MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MA-EgoQA, a benchmark for understanding multiple egocentric videos from embodied agents, highlighting current challenges and proposing a baseline model to evaluate multi-agent video understanding capabilities.

Contribution

The paper formally defines a new problem of multi-egocentric video understanding, creates a benchmark dataset with questions, and proposes a baseline model for evaluating system-level comprehension.

Findings

01

Current models struggle with multi-stream egocentric video understanding.

02

MA-EgoQA contains 1.7k questions across five categories.

03

Baseline models show significant room for improvement.

Abstract

As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KangsanKim71/MA-EgoQA
dataset· 279 dl
279 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization