Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents
M. Meng

TL;DR
This paper introduces PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents, emphasizing failure analysis over static scoring to reveal deployment insights.
Contribution
It extends traditional evaluation to focus on failures, enabling traceability, review, repair, and regression testing in real-world multilingual agent deployments.
Findings
81 samples analyzed in a real deployment setting.
14 groups showed non-zero cross-language score drift.
Maximum drift observed was 9 points.
Abstract
This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question -> Answer -> Score -> End into Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
