Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups
Gautam Veldanda

TL;DR
This paper introduces a framework and metrics to empirically analyze fairness disparities in how large language models justify decisions across different demographic groups.
Contribution
It proposes the Explanation Fairness Taxonomy (EFT), introduces novel metrics, and provides empirical evidence of disparities across multiple models and domains.
Findings
All EFT metrics show significant disparities across models and domains.
Model choice significantly affects disparity magnitude, e.g., Qwen3 exhibits larger verbosity disparities.
Prompting-based mitigations reduce explanation faithfulness disparity but not stylistic disparities.
Abstract
Large language models (LLMs) are increasingly deployed not only to make decisions but to explain them. While AI decision fairness has been studied extensively, the fairness of AI explanations (whether LLMs justify decisions with equal quality, depth, tone, and linguistic sophistication across demographic groups) has received little attention. This paper introduces the Explanation Fairness Taxonomy (EFT), a framework comprising five formally defined, operationalizable dimensions: Verbosity Disparity, Sentiment Disparity, Epistemic Hedging Disparity, Decision-Linked Explanation Disparity, and Lexical Complexity Disparity. The taxonomy is instantiated in a controlled empirical study across 80 prompt templates, four consequential decision domains (hiring, medical triage, credit assessment, legal judgment), and five LLMs: GPT-4.1, Claude Sonnet, LLaMA 3.3 70B, GPT-OSS 120B, and Qwen3 32B.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
