PerfGuard: A Performance-Aware Agent for Visual Content Generation
Zhipeng Chen, Zhongrui Zhang, Chao Zhang, Yifan Xu, Lan Yang, Jun Liu, Ke Li, Yi-Zhe Song

TL;DR
PerfGuard is a novel framework for visual content generation that models and leverages tool performance boundaries to improve planning, selection, and execution reliability in AI-generated content tasks.
Contribution
It introduces a performance-aware agent framework with three core mechanisms to enhance tool selection and task planning in visual content generation.
Findings
Outperforms state-of-the-art methods in tool selection accuracy.
Improves execution reliability in visual content tasks.
Enhances alignment with user intent.
Abstract
The advancement of Large Language Model (LLM)-powered agents has enabled automated task processing through reasoning and tool invocation capabilities. However, existing frameworks often operate under the idealized assumption that tool executions are invariably successful, relying solely on textual descriptions that fail to distinguish precise performance boundaries and cannot adapt to iterative tool updates. This gap introduces uncertainty in planning and execution, particularly in domains like visual content generation (AIGC), where nuanced tool performance significantly impacts outcomes. To address this, we propose PerfGuard, a performance-aware agent framework for visual content generation that systematically models tool performance boundaries and integrates them into task planning and scheduling. Our framework introduces three core mechanisms: (1) Performance-Aware Selection…
Peer Reviews
Decision·ICLR 2026 Poster
1. The framework introduces a novel approach to visual content generation by incorporating performance-aware mechanisms. 2. The introduction of a multi-dimensional performance evaluation system for tools is a clear strength. It provides a more detailed and reliable method for tool selection, addressing the limitations of previous systems that relied on general textual descriptions.
1. The method heavily relies on the context-learning capabilities of large language models (LLMs). While this is an interesting approach, it may not offer substantial improvements over previous systems that already utilize LLMs for similar tasks. 2. The paper does not sufficiently discuss existing methods, particularly in relation to visual content editing tools. For example, "CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update" (CVPR 2024) has already considered tool evaluation an
Novel and Important Problem: The paper correctly identifies a key weakness in current agent frameworks: the "idealized assumption" of tool capabilities. The core idea of modeling fine-grained "performance boundaries" (PASM) is a significant and logical step forward for the field. Complete Framework: The authors propose a complete, end-to-end vision, including selection (PASM), online adaptation (APU), and planner alignment (CAPO). This holistic approach is commendable. Clear Presentation: The
Unfair Comparison & Critical Confounding Variable: This is the most severe flaw. According to Appendix A.2, the framework uses GPT-4o for the Analyst and Self-Evaluator roles. This Self-Evaluator (GPT-4o) provides the entire learning signal for both the CAPO training (via Eq. 4 & 5) and the APU updates (via $R_{actual}$). The Analyst (GPT-4o) is also used during inference. The baseline methods (e.g., GenArtist) are not afforded this powerful, external model. Therefore, the SOTA claims are invali
- The paper tackles an overlooked problem in agent-based visual generation. Most prior work assumes tools execute reliably, but this paper explicitly models tool performance boundaries. The combination of PASM, APU, and CAPO feels well thought out and addresses the problem systematically rather than with isolated fixes. - The experiments are thorough. The authors compare against multiple baselines including diffusion models, CoT-based methods, and other agent systems across different task types
1. The tool library only includes a specific set of popular models like FLUX, SD3, and Step1X Edit. I'm curious why certain relevant tools are excluded. For example, what about newer 2025 models or domain-specific editing tools? This raises questions about how well the framework generalizes to different tool ecosystems. 2. When adding new tools, the method initializes performance scores by averaging similar tools. This seems reasonable but might miss unique capabilities of novel tools. Have the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Games · Mobile Crowdsensing and Crowdsourcing
