Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
Phat T. Tran-Truong, Xuan-Bach Le

TL;DR
This paper introduces TraceToChain, a pipeline that models LLM agent traces as Markov chains, enabling detailed reliability analysis, diagnostics, and uncertainty quantification beyond traditional scalar metrics.
Contribution
It presents a reproducible method for fitting agent execution traces to Markov chains with diagnostics, uncertainty estimates, and a unified success-time distribution framework.
Findings
TraceToChain accurately fits agent traces with high goodness-of-fit.
The approach unifies various reliability metrics into a single success-time distribution.
Empirical tests show close alignment between fitted models and observed data.
Abstract
Large language model (LLM) agents increasingly operate as sequential software systems, but their reliability is often summarized by scalar benchmark metrics. Metrics such as pass, pass, and the reliability decay curve (RDC) are useful summaries, but they do not identify the success-time distribution being estimated, test whether traces support that distribution, or quantify finite-trace uncertainty. We present \textsc{TraceToChain}, a reproducible pipeline that fits agent execution traces to an absorbing discrete-time Markov chain (DTMC), , with explicit diagnostics and uncertainty. The pipeline builds an automatic cluster taxonomy, estimates transitions with Laplace-smoothed maximum-likelihood estimation (MLE), checks fit with a composite Akaike information criterion (AIC) and Kolmogorov--Smirnov (KS) goodness-of-fit certificate,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
