A Data Structure Perspective to the RDD-based Apriori Algorithm on Spark
Pankaj Singh, Sudhakar Singh, P. K. Mishra, Rakhi Garg

TL;DR
This paper evaluates different data structures for the Spark-based Apriori algorithm, demonstrating that Trie and Hash Table Trie significantly outperform Hash Tree in distributed big data environments.
Contribution
It re-investigates the efficiency of data structures for Spark-based Apriori, comparing Hash Tree, Trie, and Hash Table Trie in a distributed setting.
Findings
Trie and Hash Table Trie outperform Hash Tree in Spark-based Apriori.
Both Trie and Hash Table Trie have similar performance.
Performance improvements are multiple times better than Hash Tree.
Abstract
During the recent years, a number of efficient and scalable frequent itemset mining algorithms for big data analytics have been proposed by many researchers. Initially, MapReduce-based frequent itemset mining algorithms on Hadoop cluster were proposed. Although, Hadoop has been developed as a cluster computing system for handling and processing big data, but the performance of Hadoop does not meet the expectation for the iterative algorithms of data mining, due to its high I/O, and writing and then reading intermediate results in the disk. Consequently, Spark has been developed as another cluster computing infrastructure which is much faster than Hadoop due to its in-memory computation. It is highly suitable for iterative algorithms and supports batch, interactive, iterative, and stream processing of data. Many frequent itemset mining algorithms have been re-designed on the Spark, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
