Do-Undo Bench: Reversibility for Action Understanding in Image Generation
Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

TL;DR
The paper introduces the Do-Undo benchmark to evaluate vision-language models' ability to understand and reverse real-world scene transformations caused by actions, emphasizing cause-and-effect reasoning.
Contribution
It presents a new task and benchmark for reversible action understanding in image generation, highlighting current models' limitations in this area.
Findings
Current models struggle with action reversibility.
The benchmark enables evaluation of cause-and-effect understanding.
It highlights the gap in action-aware multimodal reasoning.
Abstract
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
