TL;DR
LOME is a novel egocentric world model that generates realistic human-object interaction videos conditioned on images, text, and actions, enabling better generalization and physical realism in manipulation tasks.
Contribution
It introduces a training method that jointly estimates human actions and environment context, improving action-following accuracy and physical realism in generated videos.
Findings
LOME outperforms state-of-the-art methods in temporal consistency and motion control.
The model generalizes well to unseen scenarios.
LOME produces realistic physical effects like liquid flow during pouring.
Abstract
Learning human-object manipulation presents significant challenges due to its fine-grained and contact-rich nature of the motions involved. Traditional physics-based animation requires extensive modeling and manual setup, and more importantly, it neither generalizes well across diverse object morphologies nor scales effectively to real-world environment. To address these limitations, we introduce LOME, an egocentric world model that can generate realistic human-object interactions as videos conditioned on an input image, a text prompt, and per-frame human actions, including both body poses and hand gestures. LOME injects strong and precise action guidance into object manipulation by jointly estimating spatial human actions and the environment contexts during training. After finetuning a pretrained video generative model on videos of diverse egocentric human-object interactions, LOME…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
