Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Guanhua Chen; Yutong Yao; Shenghe Sun; Ci-Jun Gao; Shudong Liu; Lidia S. Chao; Feng Wan; Derek F. Wong

arXiv:2605.14928·cs.CL·May 15, 2026

Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu, Lidia S. Chao, Feng Wan, Derek F. Wong

PDF

TL;DR

This paper introduces Chain-of-Procedure, a hierarchical reasoning framework that enhances visual-language models for procedural question answering by improving retrieval and step decomposition, leading to significant performance gains.

Contribution

The paper presents a novel hierarchical reasoning framework, Chain-of-Procedure, addressing key limitations in current VLMs for visual procedural reasoning tasks.

Findings

01

Up to 13% absolute improvement over baselines with CoP.

02

Identified critical limitations in cross-modal retrieval and step granularity.

03

Proposed ProcedureVQA benchmark for evaluating VP-QA models.

Abstract

Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.