Transparent and Coherent Procedural Mistake Detection
Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai

TL;DR
This paper introduces a transparent approach to procedural mistake detection using visual self-dialog rationales and benchmarks the performance of vision-language models, highlighting current limitations and avenues for enhancement.
Contribution
It reformulates procedural mistake detection to include visual rationales and develops automated coherence metrics, providing new insights into model transparency and performance.
Findings
VLMs struggle with off-the-shelf PMD tasks
Incorporating coherence metrics improves accuracy and efficiency
Visual rationales enhance transparency in mistake detection
Abstract
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSoftware Engineering Research · Digital and Cyber Forensics
MethodsSoftmax · Attention Is All You Need
