CODEC: Complex Document and Entity Collection
Iain Mackie, Paul Owoicho, Carlos Gemmell, Sophie Fischer, Sean, MacAvaney, Jeffrey Dalton

TL;DR
CODEC is a comprehensive benchmark for complex document and entity retrieval focused on social science research topics, featuring a new web corpus, semantic annotations, and extensive expert judgments to evaluate and improve entity-centric search systems.
Contribution
It introduces a novel benchmark with a focused web corpus, semantic annotations, and expert judgments, enabling evaluation and development of advanced entity-centric search methods.
Findings
Query expansion with entity information improves document ranking.
Manual query reformulations enhance ranking performance.
Topics are challenging, with room for improvement in retrieval systems.
Abstract
CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks?". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert judgments on 17,509 documents and entities (416.9 per topic) from diverse automatic and interactive manual runs. The manual runs include 387 query reformulations, providing data for query performance prediction and automatic rewriting evaluation. CODEC includes analysis of state-of-the-art systems, including dense retrieval and neural re-ranking. The results show the topics are challenging with headroom for document and entity ranking improvement. Query expansion with entity information shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Biomedical Text Mining and Ontologies
