Benchmarking Agentic Workflow Generation

Shuofei Qiao; Runnan Fang; Zhisong Qiu; Xiaobin Wang; Ningyu Zhang,; Yong Jiang; Pengjun Xie; Fei Huang; Huajun Chen

arXiv:2410.07869·cs.CL·February 25, 2025·2 cites

Benchmarking Agentic Workflow Generation

Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang,, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

PDF

Open Access 1 Repo 2 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces WorfBench, a comprehensive benchmark and evaluation protocol for assessing large language models' ability to generate complex, structured workflows, revealing gaps in their planning capabilities and demonstrating benefits for downstream tasks.

Contribution

The paper presents WorfBench and WorfEval, new tools for benchmarking and evaluating LLMs' workflow generation with complex structures and diverse scenarios.

Findings

01

LLMs show a 15% gap between sequence and graph planning capabilities.

02

Generated workflows improve downstream task performance and efficiency.

03

Open-source models' generalization abilities are evaluated on held-out tasks.

Abstract

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

S1)The problem highlighted by the paper is valid and emerging S2)The dataset has some interesting features S3)The experiments seem to be extensive

Weaknesses

W1)the evaluation scores f1_chain and f1_graph in Section 2.4 were introduced as the measures for all evaluations in Section. But they are given without solid foundation why they are formulated and the right measures for the workflow chain/graph. W2) Quality control protocol is very subjective and manual, it’s difficult to judge the quality of data of the benchmark W3) Many technical details are not very clear (see questions)

Reviewer 02Rating 6Confidence 4

Strengths

1. The idea of evaluating agentic workflow generation is interesting and novel. 2. The paper puts in the work to evaluate a wide range of models. 3. The evaluated scenarios cover a wide range of agentic use cases.

Weaknesses

1. The matching between the model generated nodes and the ground truth nodes are done by using sentence bert. This may introduce errors in the evaluation step if the matching is not correct. 2. The evaluation metrics mostly focus on the similarity to a ground truth graph, but not on how the generated workflows can complete the task correctly. There might be more than one graph than can complete the given task.

Reviewer 03Rating 6Confidence 3

Strengths

- The writing and presentation is good. - The motivation of standardizing the evaluation of agent workflow is impressive. - The evaluation is extensive and the insights from the evaluation are helpful.

Weaknesses

N/A

Code & Models

Repositories

zjunlp/worfbench
pytorchOfficial

Datasets

Videos

Benchmarking Agentic Workflow Generation· slideslive

Taxonomy

TopicsScientific Computing and Data Management · Business Process Modeling and Analysis · Distributed and Parallel Computing Systems

MethodsAttention Is All You Need · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Linear Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings