QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving   Query-By-Document Search Using LLM-Reranking with Reduced Human Effort

Sriram Gopalakrishnan; Sunandita Patra

arXiv:2505.04732·cs.IR·May 9, 2025

QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM-Reranking with Reduced Human Effort

Sriram Gopalakrishnan, Sunandita Patra

PDF

Open Access

TL;DR

This paper presents QBD-RankedDataGen, a method leveraging LLMs to efficiently generate custom datasets for Query-By-Document search, reducing human effort while improving retrieval model tuning.

Contribution

It introduces a novel process for creating domain-specific QBD datasets using LLMs, enabling cost-effective and rapid dataset generation with expert input.

Findings

01

Methods reduce human effort in dataset creation

02

Generated datasets improve BM25 tuning performance

03

Approach is validated on TREC QBD datasets

Abstract

The Query-By-Document (QBD) problem is an information retrieval problem where the query is a document, and the retrieved candidates are documents that match the query document, often in a domain or query specific manner. This can be crucial for tasks such as patent matching, legal or compliance case retrieval, and academic literature review. Existing retrieval methods, including keyword search and document embeddings, can be optimized with domain-specific datasets to improve QBD search performance. However, creating these domain-specific datasets is often costly and time-consuming. Our work introduces a process to generate custom QBD-search datasets and compares a set of methods to use in this problem, which we refer to as QBD-RankedDatagen. We provide a comparative analysis of our proposed methods in terms of cost, speed, and the human interface with the domain experts. The methods we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Expert finding and Q&A systems · Text and Document Classification Technologies

MethodsSparse Evolutionary Training