MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang

TL;DR
This paper introduces MA-EgoQA, a benchmark for understanding multiple egocentric videos from embodied agents, highlighting current challenges and proposing a baseline model to evaluate multi-agent video understanding capabilities.
Contribution
The paper formally defines a new problem of multi-egocentric video understanding, creates a benchmark dataset with questions, and proposes a baseline model for evaluating system-level comprehension.
Findings
Current models struggle with multi-stream egocentric video understanding.
MA-EgoQA contains 1.7k questions across five categories.
Baseline models show significant room for improvement.
Abstract
As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
