From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri,, Anirudh Govil, Monojit Choudhury, Ponnurangam Kumaraguru, Manish Shrivastava

TL;DR
This paper introduces Cline, a large dataset of human acceptability judgments for English-Hindi code-mixed sentences, and demonstrates that fine-tuned multilingual models outperform traditional metrics and baseline models in predicting naturalness.
Contribution
The paper presents the largest dataset of human acceptability judgments for code-mixed text and evaluates various multilingual models, showing their superiority over traditional metrics.
Findings
MLLMs outperform traditional code-mixing metrics.
Decoder-only models like Llama 3.2 outperform other models.
Fine-tuned MLLMs surpass ChatGPT in code-mixed acceptability prediction.
Abstract
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
