JudgeFlow: Agentic Workflow Optimization via Block Judge

Zihan Ma; Zhikai Zhao; Chuanbo Hua; Federico Berto; Jinkyoo Park

arXiv:2601.07477·cs.AI·February 3, 2026

JudgeFlow: Agentic Workflow Optimization via Block Judge

Zihan Ma, Zhikai Zhao, Chuanbo Hua, Federico Berto, Jinkyoo Park

PDF

Open Access 3 Reviews

TL;DR

JudgeFlow introduces a fine-grained, block-level evaluation and optimization framework for agentic workflows, significantly improving efficiency and interpretability in complex AI tasks like reasoning and code generation.

Contribution

It presents a novel pipeline with a dedicated Judge module that provides detailed diagnostics, enabling targeted improvements in agentic workflows.

Findings

01

Achieves superior performance on mathematical reasoning benchmarks.

02

Enhances sample efficiency in workflow optimization.

03

Provides scalable, interpretable diagnostics for complex tasks.

Abstract

Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose JudgeFlow, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces particularly failed runs and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 2

Strengths

The paper is very well written and I commend the authors for putting in the extra work for including psuedocode and useful figures.

Weaknesses

**Baselines:** I'm not convinced that the current results demonstrate the effectiveness of the method properly: - HumanEval and MBPP are widely regarded as saturated in code generation, partly because most frontier models have been trained on these datasets. Please show additional results on LiveCodeBench, AIME2025, SWE-Bench-Live, etc. which have more complex tasks. - The set of baselines do not seem reflective of the current state of the art either. How does this compare with OpenHands, SWE-

Reviewer 02Rating 4Confidence 3

Strengths

1. Introduces block-level credit assignment for workflow refinement, moving beyond global end-to-end signals. 2. Experimental setup is clear and robust: multiple reasoning and coding benchmarks, consistent evaluation criteria, and ablations on logic blocks and judge components. 3. Well-written and easy to follow; modular structure and pseudocode clarify pipeline design. Case studies make the pipeline behavior interpretable. 4. Offers a more granular optimization signal for agentic workflows,

Weaknesses

1. **Incremental novelty over AFlow** The pipeline structure closely follows AFlow, raising concerns about incremental contribution. The primary change is shifting judgment from operator-level to block-level, while reusing the same operator abstraction and workflow-editing paradigm. The paper does not explain why this granularity shift should fundamentally improve workflow optimization, and no theoretical justification is provided. Furthermore, the optimization remains benchmark-specific, with

Reviewer 03Rating 4Confidence 5

Strengths

S1. Identifies a key problem: The paper correctly identifies that prior workflow generation methods rely heavily on heuristic optimization with LLMs alone, lacking concrete optimization signals to guide the search process effectively. S2. Improved optimization efficiency: The Judge-guided error attribution mechanism significantly enhances both efficiency and effectiveness of workflow optimization. As shown in Figure 4b, JudgeFlow achieves better performance than baseline AFlow with fewer optimi

Weaknesses

S1. Computational efficiency: The optimization process requires evaluating all samples in the dataset at each iteration, which can be computationally expensive. A potential improvement would be to gradually reduce the validation set size during optimization to lower costs. S2. Incremental contribution: While adding the Judge module is useful, the overall novelty appears limited. The paper primarily introduces optimization signals but lacks deeper insights or substantial architectural innovation

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Explainable Artificial Intelligence (XAI)