CreditAudit: 2$^\text{nd}$ Dimension for LLM Evaluation and Selection

Yiliang Song; Hongjun An; Jiangong Xiao; Haofei Zhao; Jiawei Shao; Xuelong Li

arXiv:2602.02515·cs.AI·February 5, 2026

CreditAudit: 2$^\text{nd}$ Dimension for LLM Evaluation and Selection

Yiliang Song, Hongjun An, Jiangong Xiao, Haofei Zhao, Jiawei Shao, Xuelong Li

PDF

Open Access

TL;DR

CreditAudit introduces a new evaluation framework for language models that assesses both average performance and stability across different prompts, aiding more reliable deployment decisions.

Contribution

It proposes a deployment-oriented, 2D evaluation method that considers model stability and performance, improving upon traditional single-score benchmarks.

Findings

01

Models with similar average scores can have different stability profiles.

02

Stability risk can change model prioritization in high-stakes scenarios.

03

Credit grades help interpret model reliability and guide deployment choices.

Abstract

Leaderboard scores on public benchmarks have been steadily rising and converging, with many frontier language models now separated by only marginal differences. However, these scores often fail to match users' day to day experience, because system prompts, output protocols, and interaction modes evolve under routine iteration, and in agentic multi step pipelines small protocol shifts can trigger disproportionate failures, leaving practitioners uncertain about which model to deploy. We propose CreditAudit, a deployment oriented credit audit framework that evaluates models under a family of semantically aligned and non adversarial system prompt templates across multiple benchmarks, reporting mean ability as average performance across scenarios and scenario induced fluctuation sigma as a stability risk signal, and further mapping volatility into interpretable credit grades from AAA to BBB…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Software-Defined Networks and 5G · Adversarial Robustness in Machine Learning