GenTREC: The First Test Collection Generated by Large Language Models   for Evaluating Information Retrieval Systems

Mehmet Deniz T\"urkmen; Mucahid Kutlu; Bahadir Altun; Gokalp Cosgun

arXiv:2501.02408·cs.IR·January 7, 2025·2 cites

GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems

Mehmet Deniz T\"urkmen, Mucahid Kutlu, Bahadir Altun, Gokalp Cosgun

PDF

Open Access

TL;DR

GenTREC introduces a novel, low-cost test collection for evaluating information retrieval systems, generated entirely by large language models, reducing reliance on manual relevance judgments while maintaining evaluation reliability.

Contribution

This paper presents the first IR test collection created solely from LLM-generated documents, demonstrating its effectiveness and compatibility with traditional collections for system evaluation.

Findings

01

GenTREC's IR system rankings align with traditional collections for key metrics.

02

The collection contains nearly 97,000 documents and 19,000 relevance judgments.

03

The approach significantly reduces resource requirements for IR evaluation.

Abstract

Building test collections for Information Retrieval evaluation has traditionally been a resource-intensive and time-consuming task, primarily due to the dependence on manual relevance judgments. While various cost-effective strategies have been explored, the development of such collections remains a significant challenge. In this paper, we present GenTREC , the first test collection constructed entirely from documents generated by a Large Language Model (LLM), eliminating the need for manual relevance judgments. Our approach is based on the assumption that documents generated by an LLM are inherently relevant to the prompts used for their generation. Based on this heuristic, we utilized existing TREC search topics to generate documents. We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant. To introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Information Retrieval and Search Behavior · Data Quality and Management

MethodsSparse Evolutionary Training