Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu, Lidia S. Chao, Feng Wan, Derek F. Wong

TL;DR
This paper introduces Chain-of-Procedure, a hierarchical reasoning framework that enhances visual-language models for procedural question answering by improving retrieval and step decomposition, leading to significant performance gains.
Contribution
The paper presents a novel hierarchical reasoning framework, Chain-of-Procedure, addressing key limitations in current VLMs for visual procedural reasoning tasks.
Findings
Up to 13% absolute improvement over baselines with CoP.
Identified critical limitations in cross-modal retrieval and step granularity.
Proposed ProcedureVQA benchmark for evaluating VP-QA models.
Abstract
Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
