# MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

**Authors:** Rahul Gupta, Vivek Srivastava, Mayank Singh

arXiv: 2302.11766 · 2023-02-24

## TL;DR

This paper introduces MUTANT, the first large-scale dataset of multi-sentential Hinglish code-mixed text, along with a pipeline for identifying such text in multilingual articles, enabling new research in code-mixed NLP.

## Contribution

The paper presents a novel multi-sentential Hinglish dataset and a token-level language-aware pipeline for identifying code-mixed text in multilingual articles, filling a significant resource gap.

## Key findings

- MUTANT contains 67k articles with 85k Hinglish MCTs.
- Extended metrics for measuring code-mixing to multi-sentential data.
- The pipeline effectively identifies multi-sentential code-mixed Hinglish text.

## Abstract

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.11766/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/2302.11766/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/2302.11766/full.md

---
Source: https://tomesphere.com/paper/2302.11766