Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking
Yi Fung, Ruining Zhao, Jae Doo, Chenkai Sun, Heng Ji

TL;DR
This paper presents a new dataset and methodology for acquiring and evaluating multicultural knowledge in large language models, aiming to reduce cultural bias and improve cross-cultural understanding.
Contribution
It introduces the CultureAtlas dataset and a novel approach for collecting rich, fine-grained cultural information from Wikipedia to enhance language model cultural awareness.
Findings
Created the CultureAtlas dataset covering diverse regions and groups
Demonstrated improved cultural knowledge in language models using the dataset
Provided a benchmark for evaluating multicultural knowledge in LLMs
Abstract
Pretrained large language models have revolutionized many applications but still face challenges related to cultural bias and a lack of cultural commonsense knowledge crucial for guiding cross-culture communication and interactions. Recognizing the shortcomings of existing methods in capturing the diverse and rich cultures across the world, this paper introduces a novel approach for massively multicultural knowledge acquisition. Specifically, our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages. Leveraging this valuable source of data collection, we construct the CultureAtlas dataset, which covers a wide range of sub-country level geographical regions and ethnolinguistic groups, with data cleaning and preprocessing to ensure textual assertion sentence self-containment, as well as fine-grained cultural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHigher Education Learning Practices
MethodsAttentive Walk-Aggregating Graph Neural Network
