A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

Gaoyuan Du; Amit Ahlawat; Xiaoyang Liu; Jing Wu

arXiv:2602.22442·cs.AI·March 17, 2026

A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

Gaoyuan Du, Amit Ahlawat, Xiaoyang Liu, Jing Wu

PDF

Open Access

TL;DR

This paper introduces a decision-centric evaluation framework for AutoML agents, enabling detailed assessment of intermediate decisions to improve transparency, reliability, and interpretability of autonomous machine learning systems.

Contribution

It proposes the Evaluation Agent (EA) that assesses decision quality at multiple stages, addressing the lack of structured decision-level metrics in existing AutoML evaluation practices.

Findings

01

EA detects faulty decisions with 91.9% F1 score

02

EA identifies reasoning inconsistencies independent of outcomes

03

EA attributes performance changes to specific agent decisions

Abstract

Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome-centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML systems report structured, decision-level evaluation metrics intended for post-hoc assessment of intermediate decision quality. To address this limitation, we propose an Evaluation Agent (EA) that performs decision-centric assessment of AutoML agents without interfering with their execution. The EA is designed as an observer that evaluates intermediate decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact. Across four proof-of-concept experiments, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Topic Modeling · Ethics and Social Impacts of AI