From Human Judgements to Predictive Models: Unravelling Acceptability in   Code-Mixed Sentences

Prashant Kodali; Anmol Goel; Likhith Asapu; Vamshi Krishna Bonagiri,; Anirudh Govil; Monojit Choudhury; Ponnurangam Kumaraguru; Manish Shrivastava

arXiv:2405.05572·cs.CL·May 6, 2025

From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences

Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri,, Anirudh Govil, Monojit Choudhury, Ponnurangam Kumaraguru, Manish Shrivastava

PDF

Open Access

TL;DR

This paper introduces Cline, a large dataset of human acceptability judgments for English-Hindi code-mixed sentences, and demonstrates that fine-tuned multilingual models outperform traditional metrics and baseline models in predicting naturalness.

Contribution

The paper presents the largest dataset of human acceptability judgments for code-mixed text and evaluates various multilingual models, showing their superiority over traditional metrics.

Findings

01

MLLMs outperform traditional code-mixing metrics.

02

Decoder-only models like Llama 3.2 outperform other models.

03

Fine-tuned MLLMs surpass ChatGPT in code-mixed acceptability prediction.

Abstract

Current computational approaches for analysing or generating code-mixed sentences do not explicitly model ``naturalness'' or ``acceptability'' of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi~(en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques