ECHO: Ego-Centric modeling of Human-Object interactions
Ilya A. Petrov, Vladimir Guzov, Riccardo Marin, Emre Aksan, Xu Chen, Daniel Cremers, Thabo Beeler, Gerard Pons-Moll

TL;DR
ECHO is a unified egocentric modeling framework that jointly recovers human pose, object motion, and contact dynamics from head and wrist tracking, using a novel diffusion process for flexible, robust interaction modeling.
Contribution
It introduces a tri-variate diffusion process for modeling human-object interactions from sparse egocentric data, enabling flexible input handling and training on mixed datasets.
Findings
Achieves state-of-the-art performance in human-object interaction modeling.
Robust to intermittent tracking and partial observations.
Capable of generating temporally consistent long sequences.
Abstract
Modeling human-object interactions (HOI) from an egocentric perspective is a critical yet challenging task, particularly when relying on sparse signals from wearable devices like smart glasses and watches. We present ECHO, the first unified framework to jointly recover human pose, object motion, and contact dynamics solely from head and wrist tracking. To tackle the underconstrained nature of this problem, we introduce a novel tri-variate diffusion process with independent noise schedules that models the mutual dependencies between the human, object, and interaction modalities. This formulation allows ECHO to operate with flexible input configurations, making it robust to intermittent tracking and capable of leveraging partial observations. Crucially, it enables training on a combination of large-scale human motion datasets and smaller HOI collections, learning strong priors while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gaze Tracking and Assistive Technology · Human Motion and Animation
