MaskLID: Code-Switching Language Identification through Iterative   Masking

Amir Hossein Kargaran; Fran\c{c}ois Yvon; Hinrich Sch\"utze

arXiv:2406.06263·cs.CL·June 11, 2024

MaskLID: Code-Switching Language Identification through Iterative Masking

Amir Hossein Kargaran, Fran\c{c}ois Yvon, Hinrich Sch\"utze

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MaskLID is a training-free method for code-switching language identification that iteratively masks dominant language features to accurately identify multiple languages within a single sentence.

Contribution

It introduces a novel, training-free iterative masking approach to improve code-switching language identification, complementing existing sentence-level classifiers.

Findings

01

Effective in identifying multiple languages in code-switched texts

02

Does not require additional training data or external resources

03

Applicable to existing language identification models like GlotLID and OpenLID

Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cisnlp/masklid
noneOfficial

Datasets

rmihaylov/Bg-Instructions
dataset· 8 dl
8 dl

Videos

MaskLID: Code-Switching Language Identification through Iterative Masking· underline

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection

MethodsSoftmax · fastText