Embodied VideoAgent: Persistent Memory from Egocentric Videos and   Embodied Sensors Enables Dynamic Scene Understanding

Yue Fan; Xiaojian Ma; Rongpeng Su; Jun Guo; Rujie Wu; Xi Chen; Qing Li

arXiv:2501.00358·cs.CV·January 10, 2025

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, Qing Li

PDF

Open Access

TL;DR

This paper introduces Embodied VideoAgent, an LLM-based system that constructs persistent scene memory from egocentric videos and sensory data, enabling improved understanding and reasoning in dynamic 3D environments for robotics and embodied AI.

Contribution

It proposes a novel Embodied VideoAgent that combines egocentric video and sensory inputs with a VLM-based memory update mechanism, advancing scene understanding in embodied AI.

Findings

01

Achieved 4.9% improvement on Ego4D-VQ3D

02

Achieved 5.8% improvement on OpenEQA

03

Achieved 11.7% improvement on EnvQA

Abstract

This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLM-based agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 4.9% on Ego4D-VQ3D, 5.8% on OpenEQA, and 11.7% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics