Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution

Zixin Wei; Yucan Guo; Jinyang Li; Xiaolin Han; Xiaolong Jin; Chenhao Ma

arXiv:2512.15363·cs.DB·December 18, 2025

Revisiting Task-Oriented Dataset Search in the Era of Large Language Models: Challenges, Benchmark, and Solution

Zixin Wei, Yucan Guo, Jinyang Li, Xiaolin Han, Xiaolong Jin, Chenhao Ma

PDF

Open Access

TL;DR

This paper presents KATS, an innovative system for task-oriented dataset search that leverages a knowledge graph and hybrid retrieval techniques, addressing key challenges in dataset discovery from scientific literature.

Contribution

The paper introduces KATS, a comprehensive end-to-end system with a dynamic knowledge graph and a new benchmark suite for evaluating dataset search methods.

Findings

01

KATS outperforms existing retrieval frameworks in effectiveness.

02

KATS demonstrates high efficiency in dataset retrieval tasks.

03

The CS-TDS benchmark enables standardized evaluation of dataset search systems.

Abstract

The search for suitable datasets is the critical "first step" in data-driven research, but it remains a great challenge. Researchers often need to search for datasets based on high-level task descriptions. However, existing search systems struggle with this task due to ambiguous user intent, task-to-dataset mapping and benchmark gaps, and entity ambiguity. To address these challenges, we introduce KATS, a novel end-to-end system for task-oriented dataset search from unstructured scientific literature. KATS consists of two key components, i.e., offline knowledge base construction and online query processing. The sophisticated offline pipeline automatically constructs a high-quality, dynamically updatable task-dataset knowledge graph by employing a collaborative multi-agent framework for information extraction, thereby filling the task-to-dataset mapping gap. To further address the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Information Retrieval and Search Behavior