ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
Ippei Fujisawa, Sensho Nobe, Hiroki Seto, Rina Onda, Yoshiaki Uchida,, Hiroki Ikoma, Pei-Chun Chien, Ryota Kanai

TL;DR
ProcBench is a specialized benchmark designed to evaluate large language models' ability to perform multi-step reasoning by following explicit instructions, isolating reasoning steps, and assessing performance across varied tasks.
Contribution
This paper introduces a novel benchmark that isolates multi-step inference, enabling precise evaluation of LLMs' reasoning capabilities by eliminating path exploration and implicit knowledge use.
Findings
LLMs show varying accuracy depending on reasoning complexity.
Step-aware metrics reveal specific strengths and weaknesses.
Benchmark provides a new standard for assessing reasoning in LLMs.
Abstract
Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying reasoning are not yet fully understood, but key elements include path exploration, selection of relevant knowledge, and multi-step inference. Problems are solved through the synthesis of these components. In this paper, we propose a benchmark that focuses on a specific aspect of reasoning ability: the direct evaluation of multi-step inference. To this end, we design a special reasoning task where multi-step inference is specifically focused by largely eliminating path exploration and implicit knowledge utilization. Our dataset comprises pairs of explicit instructions and corresponding questions, where the procedures necessary for solving the questions…
Peer Reviews
Decision·Submitted to ICLR 2025
Novel Focus on Procedural Reasoning: The paper addresses a critical but underexplored aspect of LLM evaluation by isolating the ability to follow explicit, multi-step instructions without relying on implicit knowledge. This focus fills an important gap in existing benchmarks. Comprehensive and Carefully Designed Controllable Benchmark: ProcBench includes 23 distinct tasks that cover a range of procedural challenges across different domains, involving string manipulation, list operations, and ar
Limited Real-World Applicability: While effective for isolating procedural reasoning, the tasks may not fully capture the complexities and nuances of real-world scenarios where implicit knowledge and domain-specific understanding are often required. Focus on Specific Task Types: The benchmark predominantly includes tasks involving string manipulation, list operations, and basic arithmetic. This may limit the assessment of models' abilities in other types of procedures, such as complex logical r
- I believe, this paper is clearly written and easy to understand. - Various metrics to measure instruction following capabilities
# Limitations - Lack of Novelty : [1], [2] already demonstrate LLM's accuracy declines as, planning depth of procedurally generated problems increases. Paper doesn't propose any new technique that can improve/augment the instruction following capabilities. - Insufficient Benchmarking: The paper fails to provide benchmarking insights across various model variants and sizes, which limits the usability of its benchmarking insights. - Effect of tokenization : Given the tasks involve character ma
- The paper introduces new metrics such as prefix match length, prefix accuracy, sequential match, and final match which provide some interesting analysis of the role task complexity (i.e. increasing the number of steps required for the task) plays in a model’s ability to perform extended precise instruction following - The benchmark creation process is clear and is constructed in a manner to assess the capability coined as “instruction follow-ability”
- The 23 tasks created are not representative of real tasks that an LLM would perform in practice and there were no experiments that attempts to correlate the performance on this type of instruction followability task to general reasoning tasks - The paper mentions that a novelty of their benchmark to other instruction following benchmarks is that they are able to assess the intermediate steps and not rely simply on the final outcome. However, their empirical findings show that FM is, in practic
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · AI-based Problem Solving and Planning
