EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background Prediction in English
Weicheng Ma, Samiha Datta, Lili Wang, Soroush Vosoughi

TL;DR
This paper introduces EnCBP, a detailed English-language dataset for predicting cultural backgrounds, revealing linguistic differences among regions and improving NLP model performance on various tasks by incorporating cultural features.
Contribution
The paper presents EnCBP, a novel fine-grained cultural background dataset for English, and demonstrates its effectiveness in enhancing NLP models across multiple tasks.
Findings
Noticeable linguistic differences among regions and states.
Cultural features improve NLP performance on several tasks.
Limited benefit of cultural info on domain-specific emotion detection.
Abstract
While cultural backgrounds have been shown to affect linguistic expressions, existing natural language processing (NLP) research on culture modeling is overly coarse-grained and does not examine cultural differences among speakers of the same language. To address this problem and augment NLP models with cultural background features, we collect, annotate, manually validate, and benchmark EnCBP, a finer-grained news-based cultural background prediction dataset in English. Through language modeling (LM) evaluations and manual analyses, we confirm that there are noticeable differences in linguistic expressions among five English-speaking countries and across four states in the US. Additionally, our evaluations on nine syntactic (CoNLL-2003), semantic (PAWS-Wiki, QNLI, STS-B, and RTE), and psycholinguistic tasks (SST-5, SST-2, Emotion, and Go-Emotions) show that, while introducing cultural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterpreting and Communication in Healthcare · Natural Language Processing Techniques · Linguistics, Language Diversity, and Identity
