MaskLID: Code-Switching Language Identification through Iterative Masking
Amir Hossein Kargaran, Fran\c{c}ois Yvon, Hinrich Sch\"utze

TL;DR
MaskLID is a training-free method for code-switching language identification that iteratively masks dominant language features to accurately identify multiple languages within a single sentence.
Contribution
It introduces a novel, training-free iterative masking approach to improve code-switching language identification, complementing existing sentence-level classifiers.
Findings
Effective in identifying multiple languages in code-switched texts
Does not require additional training data or external resources
Applicable to existing language identification models like GlotLID and OpenLID
Abstract
We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
MethodsSoftmax · fastText
