Arabic Language Text Classification Using Dependency Syntax-Based   Feature Selection

Yannis Haralambous; Yassir Elidrissi; Philippe Lenca

arXiv:1410.4863·cs.CL·October 21, 2014·19 cites

Arabic Language Text Classification Using Dependency Syntax-Based Feature Selection

Yannis Haralambous, Yassir Elidrissi, Philippe Lenca

PDF

Open Access

TL;DR

This paper evaluates Arabic text classification techniques, comparing feature selection methods and classifiers, and finds that lightly stemmed text and different classifiers perform better under specific feature set sizes.

Contribution

It introduces a comparative analysis of dependency syntax-based feature selection and classification methods for Arabic text, highlighting optimal combinations for different feature set sizes.

Findings

01

Lightly stemmed text outperforms rootified text in classification accuracy.

02

Class association rules excel with small feature sets from dependency syntax.

03

Support vector machines perform better with large, morphologically selected feature sets.

Abstract

We study the performance of Arabic text classification combining various techniques: (a) tfidf vs. dependency syntax, for feature selection and weighting; (b) class association rules vs. support vector machines, for classification. The Arabic text is used in two forms: rootified and lightly stemmed. The results we obtain show that lightly stemmed text leads to better performance than rootified text; that class association rules are better suited for small feature sets obtained by dependency syntax constraints; and, finally, that support vector machines are better suited for large feature sets based on morphological feature selection criteria.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Spam and Phishing Detection