DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions
Vijay Viswanathan, Luyu Gao, Tongshuang Wu, Pengfei Liu, Graham, Neubig

TL;DR
DataFinder introduces a natural language-based dataset recommendation system that uses a new dataset and a bi-encoder model to improve search relevance for researchers seeking suitable datasets.
Contribution
The paper presents the DataFinder Dataset and a novel bi-encoder retrieval method for recommending datasets from natural language descriptions, outperforming existing search engines.
Findings
Bi-encoder retriever outperforms baselines in relevance.
DataFinder Dataset includes 17.5K queries for training.
System effectively matches datasets to research needs.
Abstract
Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We operationalize the task of recommending datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To facilitate this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
MethodsTest
