Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems
Donghao Huang, Joon Kiat Chua, and Zhaoxia Wang

TL;DR
This paper introduces the Agentic Success Rate (ASR), a new metric for evaluating the fidelity of agent workflows in LLM-based payment systems, revealing hidden deviations undetected by traditional metrics.
Contribution
The paper proposes ASR, a trajectory-fidelity metric, and demonstrates its effectiveness in uncovering workflow deviations and guiding prompt improvements in multi-agent payment systems.
Findings
ASR uncovers hidden workflow shortcuts not detected by TSR or HF1.
Prompt refinements guided by ASR diagnostics significantly improve task success rates.
GPT-5.2 achieves perfect ASR, indicating high workflow fidelity.
Abstract
LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
