Preparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages
Soumil Mandal, Sainik Kumar Mahata, Dipankar Das

TL;DR
This paper presents a newly created Bengali-English code-mixed corpus with sentiment labels, developed using hybrid annotation systems to facilitate sentiment analysis in multilingual social media data.
Contribution
The paper introduces a gold standard Bengali-English code-mixed corpus with sentiment tags, along with hybrid annotation methods to reduce manual effort and improve annotation quality.
Findings
High inter-annotator agreement achieved
Effective hybrid systems for language and sentiment tagging
Comprehensive analysis of code-mixed properties
Abstract
Analysis of informative contents and sentiments of social users has been attempted quite intensively in the recent past. Most of the systems are usable only for monolingual data and fails or gives poor results when used on data with code-mixing property. To gather attention and encourage researchers to work on this crisis, we prepared gold standard Bengali-English code-mixed data with language and polarity tag for sentiment analysis purposes. In this paper, we discuss the systems we prepared to collect and filter raw Twitter data. In order to reduce manual work while annotation, hybrid systems combining rule based and supervised models were developed for both language and sentiment tagging. The final corpus was annotated by a group of annotators following a few guidelines. The gold standard corpus thus obtained has impressive inter-annotator agreement obtained in terms of Kappa values.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Text and Document Classification Technologies · Spam and Phishing Detection
