Imagine How To Change: Explicit Procedure Modeling for Change Captioning
Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen

TL;DR
ProCap introduces a dynamic procedure modeling framework for change captioning, capturing temporal change processes from keyframes to generate more accurate descriptions of visual differences.
Contribution
It reformulates change captioning from static image comparison to dynamic procedure modeling with a novel two-stage framework and learnable procedure queries.
Findings
Outperforms existing methods on three datasets.
Effectively captures change procedures with temporal coherence.
Reduces sensitivity to visual noise in change descriptions.
Abstract
Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task. The…
Peer Reviews
Decision·ICLR 2026 Poster
1. Novel Paradigm and Strong Motivation: The paper presents a fresh perspective by shifting from static image comparison to dynamic procedure modeling, addressing a limitation in existing methods that ignore temporal dynamics. 2. Two-Stage Design Decoupling Learning and Inference: The framework separates explicit procedure learning (Stage 1) from implicit inference (Stage 2). Using learnable queries instead of generating frames at test time is technically sound. 3. Comprehensive Experimental
1. This paper models the change procedure to improve captioning, but (132-137) the mapping γ_T: [0,1] → I is inherently non-bijective with an exponentially large solution space. For any given (I_bef, I_aft) pair, infinitely many valid procedures exist, yet the method relies on a generated sequence from FI without justifying why this particular realization should be canonical or optimal. This makes procedure modeling more difficult than directly describing the change—a small model must navigate a
1. The designed framework reformulates change captioning from static comparison to dynamic procedure modeling, which can capture the rich temporal dynamics. 2. The proposed explicit procedure modeling module can produce continuous frames between static image pairs, which facilitates the change captioning.
1. The method’s performance across the three datasets does not consistently outperform baseline methods, indicating that its overall robustness may not be sufficiently strong. 2. ProCap employs a non-LLM-based backbone, and it remains unclear how its procedure modeling module would still perform when integrated with a powerful LLM-based backbone.
1. Novel Problem Formulation: The paper introduces a conceptual shift by reformulating change captioning from a static image comparison task into a dynamic procedure modeling problem. This directly addresses a key limitation of prior work, which largely ignores the rich temporal dynamics of how a change unfolds. 2. Efficient Architecture: (1) Stage 1 effectively learns a rich representation of spatio-temporal dynamics by training an encoder on explicitly generated and sampled keyframes. (2) Sta
1. It does not show a clear performance advantage on Spot-the-Diff over other SOTA methods. 2. As shown in Table 4, the consistency loss does not yield a substantial performance gain.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
