Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Liang Xu; Chengqun Yang; Zili Lin; Fei Xu; Yifan Liu; Congsheng Xu; Yiyi Zhang; Jie Qin; Xingdong Sheng; Yunhui Liu; Xin Jin; Yichao Yan; Wenjun Zeng; Xiaokang Yang

arXiv:2508.04681·cs.CV·August 7, 2025

Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions

Liang Xu, Chengqun Yang, Zili Lin, Fei Xu, Yifan Liu, Congsheng Xu, Yiyi Zhang, Jie Qin, Xingdong Sheng, Yunhui Liu, Xin Jin, Yichao Yan, Wenjun Zeng, Xiaokang Yang

PDF

TL;DR

This paper introduces InterVLA, a large-scale egocentric human-object-human interaction dataset with multimodal data, and establishes benchmarks for motion estimation, interaction synthesis, and prediction to advance AI assistants in physical environments.

Contribution

The paper presents the first large-scale egocentric interaction dataset, InterVLA, and develops benchmarks for key tasks, integrating vision, language, and action modalities.

Findings

01

InterVLA contains 1.2 million frames of multimodal data.

02

Benchmarks for egocentric motion estimation, interaction synthesis, and prediction are established.

03

Analysis demonstrates the dataset's potential to improve AI agent capabilities.

Abstract

Learning action models from real-world human-centric interaction datasets is important towards building general-purpose intelligent assistants with efficiency. However, most existing datasets only offer specialist interaction category and ignore that AI assistants perceive and act based on first-person acquisition. We urge that both the generalist interaction knowledge and egocentric modality are indispensable. In this paper, we embed the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands. With our hybrid RGB-MoCap system, pairs of assistants and instructors engage with multiple objects and the scene following GPT-generated scripts. Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.