T${}^2$K${}^2$: The Twitter Top-K Keywords Benchmark

Ciprian-Octavian Truic\u{a} (UPB); J\'er\^ome Darmont (ERIC)

arXiv:1709.04747·cs.DB·September 15, 2017

T${}^2$K${}^2$: The Twitter Top-K Keywords Benchmark

Ciprian-Octavian Truic\u{a} (UPB), J\'er\^ome Darmont (ERIC)

PDF

TL;DR

This paper introduces T2K2, a benchmark for evaluating top-k keyword retrieval algorithms and database implementations using real Twitter data and various query complexities.

Contribution

It presents the first benchmark specifically designed for top-k keyword retrieval, enabling comparison of weighting schemes and database systems.

Findings

01

T2K2 effectively evaluates TF-IDF and Okapi BM25 schemes.

02

It compares relational and document-oriented database performances.

03

Benchmark demonstrates versatility across different query complexities.

Abstract

Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T $^{2}$ K $^{2}$ , which features a real tweet dataset and queries with various complexities and selectivities. T $^{2}$ K $^{2}$ helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T $^{2}$ K $^{2}$ 's relevance and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.