Benchmarking Top-K Keyword and Top-K Document Processing with T${}^2$K${}^2$ and T${}^2$K${}^2$D${}^2$
Ciprian-Octavian Truica (UPB), J\'er\^ome Darmont (ERIC), Alexandru, Boicea (UPB), Florin Radulescu (UPB)

TL;DR
This paper introduces T2K2 and T2K2D2, benchmarks designed to evaluate the performance of top-k keyword and document extraction methods across different weighting schemes and database systems using real tweet data.
Contribution
It presents the first dedicated benchmarks for top-k keyword and document extraction, enabling comparison of weighting schemes and database implementations.
Findings
TF-IDF and Okapi BM25 weighting schemes tested
Performance evaluated on Oracle, PostgreSQL, and MongoDB
Benchmarks demonstrate versatility and relevance
Abstract
Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present TK, a top-k keywords and documents benchmark, and its decision support-oriented evolution TKD. Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Graph Theory and Algorithms · Data Quality and Management
