Lessons Without Borders? Evaluating Cultural Alignment of LLMs Using Multilingual Story Moral Generation
Sophie Wu, Andrew Piper

TL;DR
This paper introduces a multilingual story moral generation task to evaluate cultural alignment in LLMs, revealing models' strengths in semantic similarity and preferences but limitations in cultural diversity.
Contribution
It presents a new dataset and evaluation framework for assessing how well language models capture cultural and moral diversity across languages.
Findings
Models like GPT-4o and Gemini produce morals similar to humans and preferred by evaluators.
Models show less cross-linguistic variation and focus on shared values.
Contemporary models approximate central moral tendencies but lack cultural diversity.
Abstract
Stories are key to transmitting values across cultures, but their interpretation varies across linguistic and cultural contexts. Thus, we introduce multilingual story moral generation as a novel culturally grounded evaluation task. Using a new dataset of human-written story morals collected across 14 language-culture pairs, we compare model outputs with human interpretations via semantic similarity, a human preference survey, and value categorization. We show that frontier models such as GPT-4o and Gemini generate story morals that are semantically similar to human responses and preferred by human evaluators. However, their outputs exhibit markedly less cross-linguistic variation and concentrate on a narrower set of widely shared values. These findings suggest that while contemporary models can approximate central tendencies of human moral interpretation, they struggle to reproduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
