A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

TL;DR
This paper introduces the first dataset for code-mixed Hindi-English natural language inference, using Bollywood movie snippets and crowd-sourced hypotheses, enabling research in multilingual NLI tasks.
Contribution
It presents a novel code-mixed NLI dataset with annotation protocol, linguistic analysis, and baseline evaluation using mBERT.
Findings
Dataset contains 400 premises and 2240 hypotheses.
Analysis reveals common linguistic phenomena in code-mixed data.
Baseline mBERT model achieves measurable inference performance.
Abstract
Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
