Challenging Language-Dependent Segmentation for Arabic: An Application   to Machine Translation and Part-of-Speech Tagging

Hassan Sajjad; Fahim Dalvi; Nadir Durrani; Ahmed Abdelali; Yonatan; Belinkov; Stephan Vogel

arXiv:1709.00616·cs.CL·September 5, 2017

Challenging Language-Dependent Segmentation for Arabic: An Application to Machine Translation and Part-of-Speech Tagging

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Ahmed Abdelali, Yonatan, Belinkov, Stephan Vogel

PDF

TL;DR

This paper investigates language-independent segmentation methods for Arabic NLP, demonstrating that data-driven sub-word units, characters, and word embeddings can achieve competitive results in machine translation and POS tagging, reducing reliance on complex, domain-dependent tools.

Contribution

The study introduces and evaluates three novel language-independent segmentation approaches for Arabic, showing they can match or outperform traditional morphological segmentation methods.

Findings

01

Neural machine translation is sensitive to source-target token ratio.

02

Language-independent methods achieve near state-of-the-art performance.

03

A source-target token ratio close to 1 yields optimal translation results.

Abstract

Word segmentation plays a pivotal role in improving any Arabic NLP application. Therefore, a lot of research has been spent in improving its accuracy. Off-the-shelf tools, however, are: i) complicated to use and ii) domain/dialect dependent. We explore three language-independent alternatives to morphological segmentation using: i) data-driven sub-word units, ii) characters as a unit of learning, and iii) word embeddings learned using a character CNN (Convolution Neural Network). On the tasks of Machine Translation and POS tagging, we found these methods to achieve close to, and occasionally surpass state-of-the-art performance. In our analysis, we show that a neural machine translation system is sensitive to the ratio of source and target tokens, and a ratio close to 1 or greater, gives optimal performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.