Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

M. Meng

arXiv:2604.23990·cs.AI·April 28, 2026

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

M. Meng

PDF

TL;DR

This paper introduces PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents, emphasizing failure analysis over static scoring to reveal deployment insights.

Contribution

It extends traditional evaluation to focus on failures, enabling traceability, review, repair, and regression testing in real-world multilingual agent deployments.

Findings

01

81 samples analyzed in a real deployment setting.

02

14 groups showed non-zero cross-language score drift.

03

Maximum drift observed was 9 points.

Abstract

This paper presents PSA-Eval, a failure-centered runtime evaluation framework for deployed trilingual public-space agents. The central claim is that, when the evaluation object shifts from a static input-output mapping to a runtime system, the basic unit of analysis should shift from score to failure. PSA-Eval extends the conventional chain Question -> Answer -> Score -> End into Question -> Batch -> Run -> Score -> Failure Case -> Repair -> Regression Batch, making failures traceable, reviewable, repairable, and regression-testable. The framework uses trilingual equivalent inputs as controlled probes for observing group-level cross-language policy drift. We conduct a pilot study on a real trilingual digital front-desk system deployed in the lobby of an international financial institution. The pilot uses a simplified single-foundation-model setting (MA = MB), so the observed drift…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.