Scaling associative classification for very large datasets

Luca Venturini; Elena Baralis; Paolo Garza

arXiv:1805.03887·cs.LG·May 11, 2018

Scaling associative classification for very large datasets

Luca Venturini, Elena Baralis, Paolo Garza

PDF

1 Repo

TL;DR

This paper introduces DAC, a scalable distributed associative classifier that efficiently handles massive datasets with large categorical features, improving prediction quality and interpretability using ensemble learning and novel pruning techniques.

Contribution

The paper presents DAC, a novel distributed associative classification method that enhances scalability and accuracy on large datasets through ensemble learning and rule pruning.

Findings

01

DAC outperforms state-of-the-art solutions in prediction quality.

02

DAC reduces execution time significantly.

03

The model is human-readable, aiding interpretability.

Abstract

Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/dbdmg/dac
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning