TextBenDS: a generic Textual data Benchmark for Distributed Systems
Ciprian-Octavian Truica (UPB), Elena Apostol (UPB), J\'er\^ome Darmont, (ERIC), Ira Assent

TL;DR
This paper introduces TextBenDS, a benchmark for evaluating distributed systems' performance in storing and processing textual data for top-k keyword and document extraction, emphasizing efficiency and accuracy.
Contribution
It presents a generic, document-oriented benchmark with a multidimensional data model and complex aggregation queries for text weighting schemes in distributed environments.
Findings
MongoDB shows the best overall performance.
Spark's execution time is stable across different weighting schemes.
The benchmark provides insights into the performance trade-offs in distributed text processing.
Abstract
Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, computation errors are introduced when analyzing only subsets of the dataset. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of top-k keywords and documents, it is customary to design benchmarks that compare weighting schemes within various configurations of distributed frameworks and database management systems. Thus, we propose a generic document-oriented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
