Procedural Mistake Detection via Action Effect Modeling
Wenliang Guo, Yujiang Pu, and Yu Kong

TL;DR
This paper introduces Action Effect Modeling (AEM), a probabilistic framework that jointly captures action execution and outcomes to improve mistake detection in procedural tasks, outperforming existing methods.
Contribution
The paper presents a novel unified framework that models both actions and their effects, incorporating visual and symbolic cues for more accurate mistake detection.
Findings
Achieves state-of-the-art performance on EgoPER and CaptainCook4D benchmarks.
Effect-aware representations improve mistake detection reliability.
Demonstrates the importance of modeling action outcomes alongside execution.
Abstract
Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations.…
Peer Reviews
Decision·ICLR 2026 Poster
Strong empirical results on two benchmarks, with ablations validating key components.
1. Novelty is limited, as the integration of scene graphs and VLMs builds on existing work (e.g., Hurst et al., 2024) and offers little beyond fusion. 2. Benchmarks are narrow (only two datasets) and lack evaluation across diverse domains, such as assembly or medical procedures, as mentioned in the introduction. 3. The prompt-based detector seems straightforward, and gains over baselines like ProtoMD are modest in some metrics. 4. The methodology sections are dense, with equations (e.g., Eq. 1)
- S1 The main methodological contribution that incorporates both the action execution and its effect into a probabilistic framework is original in the procedural mistake detection space. - S2 The paper builds on the growing literature in procedural mistake detection, enabling measured comparisons by using common pipeline steps from recent works. - S3 The paper includes two contemporary and popular procedural mistake detection benchmarks in its evaluation. Generally the proposed work achieves
- W1 The prompt based detector is not well explained, not does it seem to be properly analyzed in the results. Yet, it is claimed to be a primary contribution. Is this alignment not needed in general for procedural mistake detection methods? This part of the paper is very unclear. And, considering its importance in the overall paper, this significanlty detracts from the quality of the paper. - W2 The paper seems heavily dependent on GPT4o for numerous functionality. Notwithstanding the
* The paper identifies a genuine gap in current procedural‑mistake detection: most methods examine only how an action is performed, ignoring the _outcome_ that actually indicates an error. * Explicitly modeling the effect of an action is a natural extension of procedural AI and can be transferred to domains such as industrial assembly, robotics, or medical procedure guidance. * By jointly modeling execution and outcome through the proposed Action Effect Modeling (AEM) framework, the authors offe
* In the overall training objective (Eq. (10) $L = L^{seg} + L^{eff} + L^{CL} + L^{det}$), the authors simply sum the four loss terms without any weighting or scaling factors. I am unsure whether the authors verified that these terms are in comparable magnitude; if not, a large loss could dominate the optimization and bias the learned representations. A brief ablation or sensitivity study showing the relative scales of the losses (or the inclusion of trainable weighting coefficients) would help
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Human Pose and Action Recognition
