Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution
Zixin Wei, Yucan Guo, Jinyang Li, Xiaolin Han, Xiaolong Jin, Chenhao Ma

TL;DR
This paper presents KATS, an innovative system for task-oriented dataset search that leverages a knowledge graph and hybrid retrieval techniques, addressing key challenges in dataset discovery from scientific literature.
Contribution
The paper introduces KATS, a comprehensive end-to-end system with a dynamic knowledge graph and a new benchmark suite for evaluating dataset search methods.
Findings
KATS outperforms existing retrieval frameworks in effectiveness.
KATS demonstrates high efficiency in dataset retrieval tasks.
The CS-TDS benchmark enables standardized evaluation of dataset search systems.
Abstract
The search for suitable datasets is the critical "first step" in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Information Retrieval and Search Behavior
