Social Policy of Large Language Models: How GPT, Claude, DeepSeek and Grok Allocate Social Budgets in Spain and Germany
Claudia Benavides Cantos, Eduardo C. Garrido-Merch\'an

TL;DR
This study analyzes how four large language models allocate social budgets in Spain and Germany, revealing systematic biases and differences in sensitivity to national contexts, with implications for their use in public policy support.
Contribution
It provides a comparative analysis of LLMs' social policy allocations across countries, highlighting biases, model differences, and the potential for supporting public budgeting decisions.
Findings
All models under-allocate pensions by a factor close to three.
Housing and employment are over-allocated by factors of four and two respectively.
Claude shows substantive sensitivity to national context.
Abstract
We study how four widely used large language models, namely Claude, GPT-4o, DeepSeek and Grok, distribute a fixed national social budget across twelve macro-areas of public expenditure under two European national contexts, Spain and Germany. Each combination of model and country is queried six times under identical prompts and generation parameters, producing forty-eight independent allocations that are compared against approximate Organisation for Economic Co-operation and Development (OECD) reference budgets and against each other. We formalise five hypotheses regarding geopolitical bias, housing under-allocation, structural convergence, sensitivity to national context, and under-representation of politically sensitive categories. The differences between models are then validated through Kruskal-Wallis tests on each macro-area, with post-hoc Mann-Whitney U comparisons under Bonferroni…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
