Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Minjeong Ban; Jeonghwan Choi; Hyangsuk Min; Nicole Hee-Yeon Kim; Minseok Kim; Jae-Gil Lee; and Hwanjun Song

arXiv:2602.06526·cs.CL·February 9, 2026

Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, and Hwanjun Song

PDF

Open Access 3 Reviews

TL;DR

DREAM is a multi-agent debate framework using LLMs for more accurate IR benchmark annotation, significantly reducing human effort and uncovering missing relevant data to improve IR system evaluation.

Contribution

The paper introduces DREAM, a novel multi-round debate-based relevance assessment method that enhances labeling accuracy and uncovers missing data in IR benchmarks.

Findings

01

Achieved 95.2% labeling accuracy with only 3.5% human involvement.

02

Uncovered 29,824 missing relevant chunks in IR benchmarks.

03

Re-benchmarking shows unaddressed data holes distort IR system rankings.

Abstract

Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The paper is well-written and easy to follow - The method itself is intuitive and fitting for the problem itself and seems to outperform baselines and work well empirically.

Weaknesses

- Could also compare with this work and related works which use multi-agent debate to improve performance of RAG systems and contrast with these: https://arxiv.org/abs/2504.13079, https://arxiv.org/abs/2501.00332 - Could clarify on how the quality of the benchmark is impacted by the choice of number of agents and which model families these models come from.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper identifies and addresses a critical bottleneck in existing IR benchmarks: the incompleteness (i.e., holes) of current relevance-annotated datasets leading to unreliable retrieval performance evaluation of the RAG system. 2. The proposed multi-agent debate annotation pipeline is intuitive and clearly explained. 3. The constructed BRIDGE benchmark uncovers 4 times previously unlabeled relevant chunks compared to the original gold chunks with relatively low cost, which should be meani

Weaknesses

1. **Homogeneous, 2-agent setting for multi-agent debate**. In DREAM, both agents are Llama-3.3-70B-Instruct with temperature=0. Although this is a standard minimal setting for multi-agent debate, what if we use more than 2 agents or heterogeneous-model (i.e., different LLMs for each agent) to increase the diversity during debating? Will this bring more accurate annotation? 2. **Agreement treated as reliability without analyzing "wrong-but-agree"**. There is discussion of persistence of agreemen

Reviewer 03Rating 6Confidence 4

Strengths

- Strong intrinsic results: the paper shows that DREAM greatly reduces the amount of human annotation needed for the same accuracy - Downstream utility: the paper demonstrates the utility of DREAM on augmenting a real IR dataset, where human annotation cost would be high with baseline approaches. - Human evaluation: the evaluation set is vetted by human expert annotators - Baselines: the method compares to LLM-only and confidence-based baselines that support its claims.

Weaknesses

- No discussion of latency cost: compared to LLM-as-judge baselines, this debate approach is much more expensive, requiring multiple model calls across multiple rounds. While the method saves human cost, the trade-off here should be discussed. - Novelty/major missing related work section: the paper is missing most related work on multi-agent debate, including https://arxiv.org/abs/2309.13007, https://arxiv.org/abs/2305.14325, https://arxiv.org/abs/2504.13079 which have covered stances, divergen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications