DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram

TL;DR
This paper introduces DOSA, a culturally diverse dataset of social artifacts from Indian subcultures, highlighting the importance of community participation in creating culturally aware benchmarks for language models.
Contribution
It presents the first community-generated dataset of social artifacts from Indian subcultures, using participatory methods to ensure cultural relevance and diversity.
Findings
LLMs show significant variation in understanding artifacts across subcultures
The dataset reveals gaps in LLMs' cultural knowledge
Participatory data collection enhances cultural representation in NLP
Abstract
Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering. To be effective globally, these models must be aware of and account for local socio-cultural contexts, making it necessary to have benchmarks to evaluate the models for their cultural familiarity. Since the training data for LLMs is web-based and the Web is limited in its representation of information, it does not capture knowledge present within communities that are not on the Web. Thus, these models exacerbate the inequities, semantic misalignment, and stereotypes from the Web. There has been a growing call for community-centered participatory research methods in NLP. In this work, we respond to this call by using participatory research methods to introduce , the first community-generated ataset f 615…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputational and Text Analysis Methods · Language and cultural evolution
MethodsAttentive Walk-Aggregating Graph Neural Network · ALIGN
