Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models
Reem I. Masoud, Chen Feng, Shunta Asano, Saied Alshahrani, Philip Colin Treleaven, Miguel R. D. Rodrigues

TL;DR
This study investigates how linguistic properties of fine-tuning datasets influence cultural alignment in large language models, revealing that lexical features are most robust for improving cultural performance across models.
Contribution
It introduces a dataset-centric analysis linking linguistic dataset properties to cultural model performance, highlighting the importance of lexical features for robustness.
Findings
Lexical diversity (PC3) improves cultural performance consistently across models.
Semantic and diversity features (PC1-PC2) have mixed or negative effects.
Linguistic properties predict cultural alignment but effects vary by model.
Abstract
The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Computational and Text Analysis Methods · Cultural Differences and Values
