DOSA: A Dataset of Social Artifacts from Different Indian Geographical   Subcultures

Agrima Seth; Sanchit Ahuja; Kalika Bali; Sunayana Sitaram

arXiv:2403.14651·cs.CY·March 25, 2024·1 cites

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures

Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram

PDF

Open Access 1 Video

TL;DR

This paper introduces DOSA, a culturally diverse dataset of social artifacts from Indian subcultures, highlighting the importance of community participation in creating culturally aware benchmarks for language models.

Contribution

It presents the first community-generated dataset of social artifacts from Indian subcultures, using participatory methods to ensure cultural relevance and diversity.

Findings

01

LLMs show significant variation in understanding artifacts across subcultures

02

The dataset reveals gaps in LLMs' cultural knowledge

03

Participatory data collection enhances cultural representation in NLP

Abstract

Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering. To be effective globally, these models must be aware of and account for local socio-cultural contexts, making it necessary to have benchmarks to evaluate the models for their cultural familiarity. Since the training data for LLMs is web-based and the Web is limited in its representation of information, it does not capture knowledge present within communities that are not on the Web. Thus, these models exacerbate the inequities, semantic misalignment, and stereotypes from the Web. There has been a growing call for community-centered participatory research methods in NLP. In this work, we respond to this call by using participatory research methods to introduce $DOSA$ , the first community-generated $D$ ataset $o$ f 615…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures· underline

Taxonomy

TopicsComputational and Text Analysis Methods · Language and cultural evolution

MethodsAttentive Walk-Aggregating Graph Neural Network · ALIGN