Exploring Cultural Variations in Moral Judgments with Large Language Models
Hadi Mohammadi, and Ayoub Bagheri

TL;DR
This study evaluates how well large language models reflect diverse moral values across cultures, finding that instruction tuning and scale improve alignment with human moral judgments, especially in Western regions.
Contribution
It systematically compares various LLMs against global survey data to assess their ability to mirror culturally diverse moral attitudes, highlighting the impact of instruction tuning and size.
Findings
Advanced instruction-tuned models show higher correlation with human moral judgments.
Models align better with Western and educated regions than others.
Scaling and instruction tuning improve cultural sensitivity of LLMs.
Abstract
Large Language Models (LLMs) have shown strong performance across many tasks, but their ability to capture culturally diverse moral values remains unclear. In this paper, we examine whether LLMs mirror variations in moral attitudes reported by the World Values Survey (WVS) and the Pew Research Center's Global Attitudes Survey (PEW). We compare smaller monolingual and multilingual models (GPT-2, OPT, BLOOMZ, and Qwen) with recent instruction-tuned models (GPT-4o, GPT-4o-mini, Gemma-2-9b-it, and Llama-3.3-70B-Instruct). Using log-probability-based \emph{moral justifiability} scores, we correlate each model's outputs with survey data covering a broad set of ethical topics. Our results show that many earlier or smaller models often produce near-zero or negative correlations with human judgments. In contrast, advanced instruction-tuned models achieve substantially higher positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
