Towards Understanding Bias in Synthetic Data for Evaluation

Hossein A. Rahmani; Varsha Ramineni; Emine Yilmaz; Nick Craswell; Bhaskar Mitra

arXiv:2506.10301·cs.IR·October 7, 2025

Towards Understanding Bias in Synthetic Data for Evaluation

Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, Bhaskar Mitra

PDF

Open Access 1 Repo

TL;DR

This paper investigates the biases present in synthetic test collections generated by Large Language Models for evaluating IR systems, analyzing their impact on evaluation reliability and system comparison.

Contribution

It provides a comprehensive analysis of biases in LLM-generated synthetic test collections and assesses their effects on system evaluation accuracy.

Findings

01

Bias exists in synthetic test collections affecting evaluation results.

02

Bias significantly impacts absolute performance measurement.

03

Relative system comparisons are less affected by bias.

Abstract

Test collections are crucial for evaluating Information Retrieval (IR) systems. Creating a diverse set of user queries for these collections can be challenging, and obtaining relevance judgments, which indicate how well retrieved documents match a query, is often costly and resource-intensive. Recently, generating synthetic datasets using Large Language Models (LLMs) has gained attention in various applications. While previous work has used LLMs to generate synthetic queries or documents to improve ranking models, using LLMs to create synthetic test collections is still relatively unexplored. Previous work~\cite{rahmani2024synthetic} showed that synthetic test collections have the potential to be used for system evaluation, however, more analysis is needed to validate this claim. In this paper, we thoroughly investigate the reliability of synthetic test collections constructed using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rahmanidashti/biassyntheticdata
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Data Quality and Management

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training