Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation
SeongKu Kang, Bowen Jin, Wonbin Kweon, Yu Zhang, Dongha Lee, Jiawei, Han, Hwanjo Yu

TL;DR
This paper presents CCQGen, a novel framework that generates concept-covered queries to improve scientific document retrieval, addressing the challenge of limited annotated data in specialized domains.
Contribution
The paper introduces a concept coverage-based query generation method that adaptively produces comprehensive queries, enhancing retrieval accuracy in scientific literature.
Findings
CCQGen improves query quality and coverage.
Enhanced retrieval performance demonstrated in experiments.
Adaptive query generation effectively covers document concepts.
Abstract
In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Web Data Mining and Analysis · Scientific Computing and Data Management
MethodsSparse Evolutionary Training
