Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP

Francesco Sovrano

arXiv:2505.11189·cs.AI·September 24, 2025

Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP

Francesco Sovrano

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates whether global explainable AI methods can uncover belief-driven biases in large language models by adapting rule extraction techniques to text-based models, demonstrating improved detection of complex biases.

Contribution

It introduces RuleSHAP, a novel rule-extraction method combining global SHAP values with rule induction to better identify non-univariate biases in LLMs.

Findings

01

RuleSHAP improves bias detection accuracy by +94% over RuleFit.

02

Global SHAP better approximates conjunctive biases than RuleFit.

03

Hard-coded nonlinear heuristics effectively ground truth for bias detection.

Abstract

Large language models (LLMs) can amplify misinformation, undermining societal goals like the UN SDGs. We study three documented drivers of misinformation (valence framing, information overload, and oversimplification) which are often shaped by one's default beliefs. Building on evidence that LLMs encode such defaults (e.g., "joy is positive," "math is complex") and can act as "bags of heuristics," we ask: can general belief-driven heuristics behind misinformative behaviour be recovered from LLMs as clear rules? A key obstacle is that global rule-extraction methods in explainable AI (XAI) are built for numerical inputs/outputs, not text. We address this by eliciting global LLM beliefs and mapping them to numerical scores via statistically reliable abstractions, thereby enabling off-the-shelf global XAI to detect belief-related heuristics in LLMs. To obtain ground truth, we hard-code…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

1. The paper introduces a rule-extraction framework that combines the advantages of RuleFit and SHAP, improving both bias detection and interpretability. 2. Experiments conducted across multiple LLMs provide new insights into bias formation within LLMs.

Weaknesses

1. As a heuristic algorithm, despite leveraging SHAP, the method still lacks a relatively reliable theoretical foundation. The main contributions lie in Step 2 and Step 3 (Lines 216–244), where Step 2 uses global SHAP values to guide feature sampling during rule selection, and Step 3 applies global SHAP value weighting into the LASSO regression within RuleFit. While intuitively, it remains unclear whether more principled or theoretically grounded integration strategies could exist. The authors a

Reviewer 02Rating 4Confidence 3

Strengths

- The paper presents a genuinely original approach to a critical gap: adapting global XAI methods (designed for tabular/numerical data) to work with LLMs' textual inputs and outputs. - The integration of SHAP into RuleFit is technically novel. This may be the first model-agnostic rule extraction method to leverage global SHAP for steering both split selection and rule pruning, bridging SHAP's theoretical rigor with RuleFit's interpretability.

Weaknesses

- The LLM is asked to score its own beliefs, then those scores are used to explain its behavior. This is inherently circular—you're using GPT-4o's worldview to explain GPT-4o's outputs. While the correlation certificates provide statistical validation, they don't resolve the epistemological problem: high correlation between "GPT believes X is controversial" and "GPT writes controversially about X" might simply reflect consistent bias, not meaningful explanation. - Section 3 states that SHAP pert

Reviewer 03Rating 4Confidence 3

Strengths

- It introduces RuleSHAP, an original hybrid algorithm that combines SHAP’s theoretical grounding in feature attribution with RuleFit’s interpretable rule extraction, enabling interpretable symbolic bias detection — a combination not seen in prior XAI work. - The paper proposes a statistically grounded belief abstraction framework that transforms textual LLM inputs and outputs into ordered numerical spaces, bridging a known gap between text-based generative models and numeric XAI methods.

Weaknesses

- The belief abstraction layer converts textual behavior into numerical variables. This transformation, while necessary for SHAP, risks discarding contextual and semantic richness—especially when bias manifests subtly (e.g., through metaphor or framing tone). - The paper adopts MRR@1 as the main quantitative measure for bias detection performance. However, this metric assumes a rank-based relevance formulation that may not directly capture the semantic correctness or interpretability of rules. -

Code & Models

Repositories

francesco-sovrano/ruleshap
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Misinformation and Its Impacts · Artificial Intelligence in Healthcare and Education

MethodsLLaMA · Shapley Additive Explanations