In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Zeyu Tang, Sang T. Truong, Deonna Owens, Shreyas Sharma, Yibo Jacky Zhang, Brando Miranda, Sanmi Koyejo

TL;DR
This paper advocates for evaluating LLM fairness through in-situ conversational behavior rather than standardized tests, revealing more stable and generalizable behavioral signatures.
Contribution
It introduces MAC-Fairness, a multi-agent dialogue framework, to assess fairness by embedding controlled variations in natural conversations.
Findings
Standardized-test scores are heavily influenced by prompt construction.
In-situ evaluation shows stable, model-specific fairness behaviors.
Behavioral signatures generalize across different fairness benchmarks.
Abstract
LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
