Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
Mengmeng Ge, Takashi Isobe, Xu Jia, Yanan Sun, Zetong Yang, Weinong Wang, Dong Zhou, Dong Li, Huchuan Lu, Emad Barsoum

TL;DR
This paper introduces EgoIn, a framework for generating intermediate frames depicting object transformations in egocentric videos, addressing understanding and consistency challenges in visual state transitions.
Contribution
The paper presents a novel approach with TransitionVLM, transition conditioning, and object-aware supervision for improved egocentric object state transition generation.
Findings
EgoIn outperforms existing models in generating coherent transformation sequences.
The method effectively preserves object appearance during transitions.
Experiments demonstrate semantic and visual quality improvements.
Abstract
Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
