T${}^2$K${}^2$: The Twitter Top-K Keywords Benchmark
Ciprian-Octavian Truic\u{a} (UPB), J\'er\^ome Darmont (ERIC)

TL;DR
This paper introduces T2K2, a benchmark for evaluating top-k keyword retrieval algorithms and database implementations using real Twitter data and various query complexities.
Contribution
It presents the first benchmark specifically designed for top-k keyword retrieval, enabling comparison of weighting schemes and database systems.
Findings
T2K2 effectively evaluates TF-IDF and Okapi BM25 schemes.
It compares relational and document-oriented database performances.
Benchmark demonstrates versatility across different query complexities.
Abstract
Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, TK, which features a real tweet dataset and queries with various complexities and selectivities. TK helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate TK's relevance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
