Dealing with Sparse Document and Topic Representations: Lab Report for CHiC 2012
Philipp Schaer, Daniel Hienert, Frank Sawitzki, Andias Wira-Alam,, Thomas L\"uke

TL;DR
This paper discusses strategies to improve retrieval performance on a large, sparse document collection by employing term suggestion and query expansion techniques based on Wikipedia concepts and co-occurrence statistics.
Contribution
It introduces and evaluates three novel query expansion methods tailored for sparse document and topic representations in a large-scale cultural heritage dataset.
Findings
Improved retrieval effectiveness with query expansion techniques
Wikipedia-based concept extraction enhances topic understanding
Co-occurrence statistics aid in addressing data sparsity
Abstract
We will report on the participation of GESIS at the first CHiC workshop (Cultural Heritage in CLEF). Being held for the first time, no prior experience with the new data set, a document dump of Europeana with ca. 23 million documents, exists. The most prominent issues that arose from pretests with this test collection were the very unspecific topics and sparse document representations. Only half of the topics (26/50) contained a description and the titles were usually short with just around two words. Therefore we focused on three different term suggestion and query expansion mechanisms to surpass the sparse topical description. We used two methods that build on concept extraction from Wikipedia and on a method that applied co-occurrence statistics on the available Europeana corpus. In the following paper we will present the approaches and preliminary results from their assessments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Wikis in Education and Collaboration · Topic Modeling
