Part of speech tagging for code switched data
Fahad AlGhamdi, Giovanni Molina, Mona Diab, Thamar Solorio, Abdelati, Hawwari, Victor Soto, Julia Hirschberg

TL;DR
This paper investigates methods for effective Part of Speech tagging in code-switched data, comparing multiple strategies across Spanish-English and Arabic dialects, and finds that a machine learning approach with two POS taggers performs best.
Contribution
It introduces a machine learning framework combining two POS taggers for code-switched data, demonstrating improved accuracy over other methods.
Findings
Two POS taggers outperform single taggers in CS data
Unified CS-trained tagger shows competitive performance
Machine learning approach yields best results in experiments
Abstract
We address the problem of Part of Speech tagging (POS) in the context of linguistic code switching (CS). CS is the phenomenon where a speaker switches between two languages or variants of the same language within or across utterances, known as intra-sentential or inter-sentential CS, respectively. Processing CS data is especially challenging in intra-sentential data given state of the art monolingual NLP technology since such technology is geared toward the processing of one language at a time. In this paper we explore multiple strategies of applying state of the art POS taggers to CS data. We investigate the landscape in two CS language pairs, Spanish-English and Modern Standard Arabic-Arabic dialects. We compare the use of two POS taggers vs. a unified tagger trained on CS data. Our results show that applying a machine learning framework using two state of the art POS taggers achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
