Comparing how Large Language Models perform against keyword-based searches for social science research data discovery
Mark Green, Maura Halstead, Caroline Jay, Richard Kingston, Alex Singleton, David Topping

TL;DR
This study compares large language model-based semantic search with traditional keyword search for social science data discovery, showing LLM's advantages in handling complex, misspelled, and place-based queries, thus improving data retrieval effectiveness.
Contribution
The paper provides an empirical evaluation demonstrating that LLM-based semantic search enhances social science data discovery by effectively handling complex and misspelled queries, complementing traditional keyword search.
Findings
Semantic search returns more results, especially for complex queries.
High semantic similarity between datasets retrieved by both methods.
Semantic search is robust to spelling errors and interprets geographic relevance.
Abstract
This paper evaluates the performance of a large language model (LLM) based semantic search tool relative to a traditional keyword-based search for data discovery. Using real-world search behaviour, we compare outputs from a bespoke semantic search system applied to UKRI data services with the Consumer Data Research Centre (CDRC) keyword search. Analysis is based on 131 of the most frequently used search terms extracted from CDRC search logs between December 2023 and October 2024. We assess differences in the volume, overlap, ranking, and relevance of returned datasets using descriptive statistics, qualitative inspection, and quantitative similarity measures, including exact dataset overlap, Jaccard similarity, and cosine similarity derived from BERT embeddings. Results show that the semantic search consistently returns a larger number of results than the keyword search and performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Expert finding and Q&A systems · Data Quality and Management
