Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Kawin Mayilvaghanan; Siddhant Gupta; Ayush Kumar

arXiv:2602.14970·cs.CL·February 17, 2026

Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar

PDF

Open Access

TL;DR

This paper evaluates the fairness of LLM-based contact center QA systems across multiple bias dimensions, revealing systematic disparities and the limited effectiveness of fairness-aware prompts, emphasizing the need for standardized fairness audits.

Contribution

It introduces a counterfactual fairness evaluation framework for LLM-based QA systems, applying it to real-world data and analyzing bias sources and mitigation strategies.

Findings

01

Systematic disparities in fairness metrics across models and bias dimensions.

02

Contextual priming significantly increases fairness degradation.

03

Fairness-aware prompting offers only modest improvements.

Abstract

Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Ethics and Social Impacts of AI · AI and HR Technologies