Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Ahmed Abdelali, Yonatan, Belinkov, Stephan Vogel

TL;DR
This paper investigates language-independent segmentation methods for Arabic NLP, demonstrating that data-driven sub-word units, characters, and word embeddings can achieve competitive results in machine translation and POS tagging, reducing reliance on complex, domain-dependent tools.
Contribution
The study introduces and evaluates three novel language-independent segmentation approaches for Arabic, showing they can match or outperform traditional morphological segmentation methods.
Findings
Neural machine translation is sensitive to source-target token ratio.
Language-independent methods achieve near state-of-the-art performance.
A source-target token ratio close to 1 yields optimal translation results.
Abstract
Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
