JudgeFlow: Agentic Workflow Optimization via Block Judge
Zihan Ma, Zhikai Zhao, Chuanbo Hua, Federico Berto, Jinkyoo Park

TL;DR
JudgeFlow introduces a fine-grained, block-level evaluation and optimization framework for agentic workflows, significantly improving efficiency and interpretability in complex AI tasks like reasoning and code generation.
Contribution
It presents a novel pipeline with a dedicated Judge module that provides detailed diagnostics, enabling targeted improvements in agentic workflows.
Findings
Achieves superior performance on mathematical reasoning benchmarks.
Enhances sample efficiency in workflow optimization.
Provides scalable, interpretable diagnostics for complex tasks.
Abstract
Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose JudgeFlow, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces particularly failed runs and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is very well written and I commend the authors for putting in the extra work for including psuedocode and useful figures.
**Baselines:** I'm not convinced that the current results demonstrate the effectiveness of the method properly: - HumanEval and MBPP are widely regarded as saturated in code generation, partly because most frontier models have been trained on these datasets. Please show additional results on LiveCodeBench, AIME2025, SWE-Bench-Live, etc. which have more complex tasks. - The set of baselines do not seem reflective of the current state of the art either. How does this compare with OpenHands, SWE-
1. Introduces block-level credit assignment for workflow refinement, moving beyond global end-to-end signals. 2. Experimental setup is clear and robust: multiple reasoning and coding benchmarks, consistent evaluation criteria, and ablations on logic blocks and judge components. 3. Well-written and easy to follow; modular structure and pseudocode clarify pipeline design. Case studies make the pipeline behavior interpretable. 4. Offers a more granular optimization signal for agentic workflows,
1. **Incremental novelty over AFlow** The pipeline structure closely follows AFlow, raising concerns about incremental contribution. The primary change is shifting judgment from operator-level to block-level, while reusing the same operator abstraction and workflow-editing paradigm. The paper does not explain why this granularity shift should fundamentally improve workflow optimization, and no theoretical justification is provided. Furthermore, the optimization remains benchmark-specific, with
S1. Identifies a key problem: The paper correctly identifies that prior workflow generation methods rely heavily on heuristic optimization with LLMs alone, lacking concrete optimization signals to guide the search process effectively. S2. Improved optimization efficiency: The Judge-guided error attribution mechanism significantly enhances both efficiency and effectiveness of workflow optimization. As shown in Figure 4b, JudgeFlow achieves better performance than baseline AFlow with fewer optimi
S1. Computational efficiency: The optimization process requires evaluating all samples in the dataset at each iteration, which can be computationally expensive. A potential improvement would be to gradually reduce the validation set size during optimization to lower costs. S2. Incremental contribution: While adding the Judge module is useful, the overall novelty appears limited. The paper primarily introduces optimization signals but lacks deeper insights or substantial architectural innovation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Explainable Artificial Intelligence (XAI)
