TL;DR
This paper provides a comprehensive analysis of linguistic taboos and euphemisms in Nepali, including a new dataset of offensive terms, to aid in offensive language detection and language learning.
Contribution
It introduces a detailed corpus-based study of Nepali offensive language, categorizes taboo words, and presents a manually curated dataset of over 1000 offensive terms.
Findings
Identified 18 categories of linguistic offenses
Discussed 12 common euphemisms and their usage
Created a dataset of 1000+ offensive terms
Abstract
Languages across the world have words, phrases, and behaviors -- the taboos -- that are avoided in public communication considering them as obscene or disturbing to the social, religious, and ethical values of society. However, people deliberately use these linguistic taboos and other language constructs to make hurtful, derogatory, and obscene comments. It is nearly impossible to construct a universal set of offensive or taboo terms because offensiveness is determined entirely by different factors such as socio-physical setting, speaker-listener relationship, and word choices. In this paper, we present a detailed corpus-based study of offensive language in Nepali. We identify and describe more than 18 different categories of linguistic offenses including politics, religion, race, and sex. We discuss 12 common euphemisms such as synonym, metaphor and circumlocution. In addition, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
