Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

Donghao Huang; Joon Kiat Chua; and Zhaoxia Wang

arXiv:2605.06457·cs.AI·May 8, 2026

Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems

Donghao Huang, Joon Kiat Chua, and Zhaoxia Wang

PDF

TL;DR

This paper introduces the Agentic Success Rate (ASR), a new metric for evaluating the fidelity of agent workflows in LLM-based payment systems, revealing hidden deviations undetected by traditional metrics.

Contribution

The paper proposes ASR, a trajectory-fidelity metric, and demonstrates its effectiveness in uncovering workflow deviations and guiding prompt improvements in multi-agent payment systems.

Findings

01

ASR uncovers hidden workflow shortcuts not detected by TSR or HF1.

02

Prompt refinements guided by ASR diagnostics significantly improve task success rates.

03

GPT-5.2 achieves perfect ASR, indicating high workflow fidelity.

Abstract

LLM-based multi-agent systems are increasingly deployed for payment workflows, yet prevailing metrics, Task Success Rate (TSR) and Agent Handoff F1-Score (HF1), capture only final outcomes or unordered routing decisions. We introduce the Agentic Success Rate (ASR), a trajectory-fidelity metric that compares observed and expected agent execution sequences at the transition level, decomposing performance into Transition Recall and Transition Precision. Applied to the Hierarchical Multi-Agent System for Payments (HMASP) across 18 LLMs and 90,000 task instances, ASR reveals that 10 of 18 models systematically skip a confirmation checkpoint during payment checkout, a deviation invisible to both TSR and HF1, while 8 models enforce the checkpoint perfectly. Notably, GPT-4.1 exhibits hidden workflow shortcuts despite achieving perfect TSR and HF1, while GPT-5.2 achieves perfect ASR. Prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.