The N-Body Problem: Parallel Execution from Single-Person Egocentric Video
Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen

TL;DR
This paper formalizes the N-Body Problem in egocentric videos, proposing a structured prompting method for Vision-Language Models to generate feasible parallel activity plans, significantly improving action coverage and reducing conflicts.
Contribution
It introduces the N-Body Problem framework, new evaluation metrics, and a structured prompting strategy for VLMs to reason about multi-agent task execution from a single egocentric video.
Findings
Action coverage increased by 45% for N=2
Collision rates reduced by 55%
Object and causal conflicts reduced by 45-55%
Abstract
Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Social Robot Interaction and HRI
