The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
Alif Al Hasan

TL;DR
This paper introduces a causal framework using Probabilistic Graphical Models to analyze regional biases in Large Language Models, revealing disparities between observational and causal bias assessments across diverse models.
Contribution
It presents a novel causal analysis method for LLM bias evaluation, highlighting limitations of traditional fairness metrics and uncovering regional bias trends.
Findings
Western models show higher causal refusal rates for certain demographics
Eastern models have low intervention rates but regional sensitivities
Standard fairness metrics may overestimate demographic bias
Abstract
As Large Language Models (LLMs) are integrated into global software systems, ensuring equitable safety guardrails is a critical requirement. Current fairness evaluations predominantly measure bias observationally, a methodology confounded by the inherent toxicity of topics naturally paired with specific demographics in testing datasets. This study introduces a Probabilistic Graphical Model (PGM) framework to audit LLM safety mechanisms causally. By applying Pearl's do-operator, we mathematically isolate the causal effect of injecting a cultural demographic into a prompt. We conduct a large-scale empirical analysis across seven instruction-tuned models spanning diverse origins: the United States (Llama-3.1-8B, Gemma-2-9B), Europe (Mistral-7B-v0.3), the UAE (Falcon3-7B), China (Qwen2.5-7B, DeepSeek-7B), and India (Airavata-7B). Utilizing two distinct datasets (ToxiGen and BOLD), the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
