Benchmarking Agentic Workflow Generation
Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang,, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

TL;DR
This paper introduces WorfBench, a comprehensive benchmark and evaluation protocol for assessing large language models' ability to generate complex, structured workflows, revealing gaps in their planning capabilities and demonstrating benefits for downstream tasks.
Contribution
The paper presents WorfBench and WorfEval, new tools for benchmarking and evaluating LLMs' workflow generation with complex structures and diverse scenarios.
Findings
LLMs show a 15% gap between sequence and graph planning capabilities.
Generated workflows improve downstream task performance and efficiency.
Open-source models' generalization abilities are evaluated on held-out tasks.
Abstract
Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we…
Peer Reviews
Decision·ICLR 2025 Poster
S1)The problem highlighted by the paper is valid and emerging S2)The dataset has some interesting features S3)The experiments seem to be extensive
W1)the evaluation scores f1_chain and f1_graph in Section 2.4 were introduced as the measures for all evaluations in Section. But they are given without solid foundation why they are formulated and the right measures for the workflow chain/graph. W2) Quality control protocol is very subjective and manual, it’s difficult to judge the quality of data of the benchmark W3) Many technical details are not very clear (see questions)
1. The idea of evaluating agentic workflow generation is interesting and novel. 2. The paper puts in the work to evaluate a wide range of models. 3. The evaluated scenarios cover a wide range of agentic use cases.
1. The matching between the model generated nodes and the ground truth nodes are done by using sentence bert. This may introduce errors in the evaluation step if the matching is not correct. 2. The evaluation metrics mostly focus on the similarity to a ground truth graph, but not on how the generated workflows can complete the task correctly. There might be more than one graph than can complete the given task.
- The writing and presentation is good. - The motivation of standardizing the evaluation of agent workflow is impressive. - The evaluation is extensive and the insights from the evaluation are helpful.
N/A
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management · Business Process Modeling and Analysis · Distributed and Parallel Computing Systems
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
