Document Classification Using Distributed Machine Learning

Galip Aydin; Ibrahim Riza Hallac

arXiv:1802.03597·cs.IR·February 13, 2018

Document Classification Using Distributed Machine Learning

Galip Aydin, Ibrahim Riza Hallac

PDF

TL;DR

This paper explores the application of distributed machine learning technologies, including Hadoop, Spark, and Mahout, to improve the performance of Naive Bayes classification for Turkish news categorization.

Contribution

It demonstrates how Apache Big Data tools can be integrated with machine learning algorithms for effective document classification in a non-English language.

Findings

01

Naive Bayes achieves high success rates in Turkish news classification.

02

Distributed technologies significantly improve processing efficiency.

03

The approach is scalable for large datasets.

Abstract

In this paper, we investigate the performance and success rates of Na\"ive Bayes Classification Algorithm for automatic classification of Turkish news into predetermined categories like economy, life, health etc. We use Apache Big Data technologies such as Hadoop, HDFS, Spark and Mahout, and apply these distributed technologies to Machine Learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.