# ECHO: Ego-Centric modeling of Human-Object interactions

**Authors:** Ilya A. Petrov, Vladimir Guzov, Riccardo Marin, Emre Aksan, Xu Chen, Daniel Cremers, Thabo Beeler, Gerard Pons-Moll

arXiv: 2508.21556 · 2026-03-17

## TL;DR

ECHO is a unified egocentric modeling framework that jointly recovers human pose, object motion, and contact dynamics from head and wrist tracking, using a novel diffusion process for flexible, robust interaction modeling.

## Contribution

It introduces a tri-variate diffusion process for modeling human-object interactions from sparse egocentric data, enabling flexible input handling and training on mixed datasets.

## Key findings

- Achieves state-of-the-art performance in human-object interaction modeling.
- Robust to intermittent tracking and partial observations.
- Capable of generating temporally consistent long sequences.

## Abstract

Modeling human-object interactions (HOI) from an egocentric perspective is a critical yet challenging task, particularly when relying on sparse signals from wearable devices like smart glasses and watches. We present ECHO, the first unified framework to jointly recover human pose, object motion, and contact dynamics solely from head and wrist tracking. To tackle the underconstrained nature of this problem, we introduce a novel tri-variate diffusion process with independent noise schedules that models the mutual dependencies between the human, object, and interaction modalities. This formulation allows ECHO to operate with flexible input configurations, making it robust to intermittent tracking and capable of leveraging partial observations. Crucially, it enables training on a combination of large-scale human motion datasets and smaller HOI collections, learning strong priors while capturing interaction nuances. Furthermore, we employ a smooth inpainting inference mechanism that enables the generation of temporally consistent interactions for arbitrarily long sequences. Extensive evaluations demonstrate that ECHO achieves state-of-the-art performance, significantly outperforming existing methods lacking such flexibility.

---
Source: https://tomesphere.com/paper/2508.21556