A Data Structure Perspective to the RDD-based Apriori Algorithm on Spark

Pankaj Singh; Sudhakar Singh; P. K. Mishra; Rakhi Garg

arXiv:1908.01338·cs.DC·August 6, 2019

A Data Structure Perspective to the RDD-based Apriori Algorithm on Spark

Pankaj Singh, Sudhakar Singh, P. K. Mishra, Rakhi Garg

PDF

TL;DR

This paper evaluates different data structures for the Spark-based Apriori algorithm, demonstrating that Trie and Hash Table Trie significantly outperform Hash Tree in distributed big data environments.

Contribution

It re-investigates the efficiency of data structures for Spark-based Apriori, comparing Hash Tree, Trie, and Hash Table Trie in a distributed setting.

Findings

01

Trie and Hash Table Trie outperform Hash Tree in Spark-based Apriori.

02

Both Trie and Hash Table Trie have similar performance.

03

Performance improvements are multiple times better than Hash Tree.

Abstract

During the recent years, a number of efficient and scalable frequent itemset mining algorithms for big data analytics have been proposed by many researchers. Initially, MapReduce-based frequent itemset mining algorithms on Hadoop cluster were proposed. Although, Hadoop has been developed as a cluster computing system for handling and processing big data, but the performance of Hadoop does not meet the expectation for the iterative algorithms of data mining, due to its high I/O, and writing and then reading intermediate results in the disk. Consequently, Spark has been developed as another cluster computing infrastructure which is much faster than Hadoop due to its in-memory computation. It is highly suitable for iterative algorithms and supports batch, interactive, iterative, and stream processing of data. Many frequent itemset mining algorithms have been re-designed on the Spark, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.