Cocktail: A Comprehensive Information Retrieval Benchmark with   LLM-Generated Documents Integration

Sunhao Dai; Weihao Liu; Yuqi Zhou; Liang Pang; Rongju Ruan; Gang Wang,; Zhenhua Dong; Jun Xu; Ji-Rong Wen

arXiv:2405.16546·cs.IR·July 3, 2024

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

Sunhao Dai, Weihao Liu, Yuqi Zhou, Liang Pang, Rongju Ruan, Gang Wang,, Zhenhua Dong, Jun Xu, Ji-Rong Wen

PDF

Open Access 1 Repo

TL;DR

Cocktail is a new comprehensive benchmark designed to evaluate information retrieval models in a landscape increasingly populated by both human-written and LLM-generated content, addressing a critical gap in IR research tools.

Contribution

We introduce Cocktail, a diverse IR benchmark with mixed datasets and a new dataset to evaluate models in the LLM era, along with extensive experiments revealing performance and bias trade-offs.

Findings

01

Neural retrieval models show a trade-off between ranking performance and source bias.

02

The benchmark highlights the need for balanced IR system design in the presence of LLM-generated content.

03

Cocktail provides a standardized platform for future IR research in the context of LLMs.

Abstract

The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kid-22/cocktail
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies