Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

Mengmeng Ge; Takashi Isobe; Xu Jia; Yanan Sun; Zetong Yang; Weinong Wang; Dong Zhou; Dong Li; Huchuan Lu; Emad Barsoum

arXiv:2604.17749·cs.CV·April 21, 2026

Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

Mengmeng Ge, Takashi Isobe, Xu Jia, Yanan Sun, Zetong Yang, Weinong Wang, Dong Zhou, Dong Li, Huchuan Lu, Emad Barsoum

PDF

TL;DR

This paper introduces EgoIn, a framework for generating intermediate frames depicting object transformations in egocentric videos, addressing understanding and consistency challenges in visual state transitions.

Contribution

The paper presents a novel approach with TransitionVLM, transition conditioning, and object-aware supervision for improved egocentric object state transition generation.

Findings

01

EgoIn outperforms existing models in generating coherent transformation sequences.

02

The method effectively preserves object appearance during transitions.

03

Experiments demonstrate semantic and visual quality improvements.

Abstract

Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.