Agentic Confidence Calibration

Jiaxin Zhang; Caiming Xiong; Chien-Sheng Wu

arXiv:2601.15778·cs.AI·January 23, 2026

Agentic Confidence Calibration

Jiaxin Zhang, Caiming Xiong, Chien-Sheng Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Holistic Trajectory Calibration (HTC), a new framework for calibrating AI agent confidence throughout complex, multi-step tasks, improving reliability and interpretability in high-stakes applications.

Contribution

The paper presents HTC, a novel process-level calibration method for agentic systems, with transferability, interpretability, and strong out-of-domain performance.

Findings

01

HTC outperforms strong baselines in calibration and discrimination.

02

HTC provides interpretability by revealing failure signals.

03

GAC achieves the best calibration on out-of-domain benchmarks.

Abstract

AI agents are rapidly advancing from passive language models to autonomous systems executing complex, multi-step tasks. Yet their overconfidence in failure remains a fundamental barrier to deployment in high-stakes settings. Existing calibration methods, built for static single-turn outputs, cannot address the unique challenges of agentic systems, such as compounding errors along trajectories, uncertainty from external tools, and opaque failure modes. To address these challenges, we introduce, for the first time, the problem of Agentic Confidence Calibration and propose Holistic Trajectory Calibration (HTC), a novel diagnostic framework that extracts rich process-level features ranging from macro dynamics to micro stability across an agent's entire trajectory. Powered by a simple, interpretable model, HTC consistently surpasses strong baselines in both calibration and discrimination,…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 8Confidence 3

Strengths

* Calibration for agentic systems with multi-step, interactive trajectories is a timely and important topic, and this paper presents a great initial effort to address the gap. * The proposed framework, which decomposes different uncertainty signals in the agent trajectories and learns a calibration function to map features to a calibrated confidence score, is simple, lightweight, interpretable, and novel. The pretrained, general-purpose agent calibrator achieves good generalization on challengin

Weaknesses

* While the task and model selection is quite comprehensive for the evaluation, baselines are largely simple verbalized or token-logprob-based methods. A few related methods (that are cited but not compared) could be included as potentially stronger baselines [1,2]. * The method requires log-prob access, which is not available for black-box models. --- [1]. UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making [2]. SAUP: Situation Awareness Uncertainty

Reviewer 02Rating 4Confidence 3

Strengths

The problem formulation of trajectory-level cabliration (ie. HTC) and design principles are well defined and executed.

Weaknesses

A very limited exploration of other possible agentic framework besides the smolagents and CodeAct. Unfortunately, the experiment sections seem a bit laundry list of what this framework is capable of, but not provide clear strengths and validation of the framework as calibration methods with respect to non-process based methods.

Reviewer 03Rating 6Confidence 4

Strengths

- Experimental setup is rigorous, comparisons are conducted on a wide variety of benchmarks, and the baselines mentioned in the paper are well thought out. - Problem formulation is sound. The paper adequately motivates the setup, and justifies the problem's importance by tying it to real-world challenges faced during the deployment of AI agents. - Results show promise as a general-purpose method for estimating and implementing calibration.

Weaknesses

- The writing could be improved. For example, The structure of the results section is a bit all of the place: it includes the experimental results, then a discussion of the results, and then goes back to the results, making it really confusing for the reader. - There are many areas where concepts are introduced but never re-used again. For example, I was confused about where the learning-based baselines are compared to HTC? I couldn't see these results either in the main paper or the appendix (I

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)