Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Shweta Mahajan; Shreya Kadambi; Hoang Le; Rajeev Yasarla; Apratim Bhattacharyya; Munawar Hayat; Fatih Porikli

arXiv:2512.13609·cs.CV·May 15, 2026

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

PDF

TL;DR

The paper introduces the Do-Undo benchmark to evaluate vision-language models' ability to understand and reverse real-world scene transformations caused by actions, emphasizing cause-and-effect reasoning.

Contribution

It presents a new task and benchmark for reversible action understanding in image generation, highlighting current models' limitations in this area.

Findings

01

Current models struggle with action reversibility.

02

The benchmark enables evaluation of cause-and-effect understanding.

03

It highlights the gap in action-aware multimodal reasoning.

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.