Comparing how Large Language Models perform against keyword-based searches for social science research data discovery

Mark Green; Maura Halstead; Caroline Jay; Richard Kingston; Alex Singleton; David Topping

arXiv:2601.19559·cs.IR·January 28, 2026

Comparing how Large Language Models perform against keyword-based searches for social science research data discovery

Mark Green, Maura Halstead, Caroline Jay, Richard Kingston, Alex Singleton, David Topping

PDF

Open Access

TL;DR

This study compares large language model-based semantic search with traditional keyword search for social science data discovery, showing LLM's advantages in handling complex, misspelled, and place-based queries, thus improving data retrieval effectiveness.

Contribution

The paper provides an empirical evaluation demonstrating that LLM-based semantic search enhances social science data discovery by effectively handling complex and misspelled queries, complementing traditional keyword search.

Findings

01

Semantic search returns more results, especially for complex queries.

02

High semantic similarity between datasets retrieved by both methods.

03

Semantic search is robust to spelling errors and interprets geographic relevance.

Abstract

This paper evaluates the performance of a large language model (LLM) based semantic search tool relative to a traditional keyword-based search for data discovery. Using real-world search behaviour, we compare outputs from a bespoke semantic search system applied to UKRI data services with the Consumer Data Research Centre (CDRC) keyword search. Analysis is based on 131 of the most frequently used search terms extracted from CDRC search logs between December 2023 and October 2024. We assess differences in the volume, overlap, ranking, and relevance of returned datasets using descriptive statistics, qualitative inspection, and quantitative similarity measures, including exact dataset overlap, Jaccard similarity, and cosine similarity derived from BERT embeddings. Results show that the semantic search consistently returns a larger number of results than the keyword search and performs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Expert finding and Q&A systems · Data Quality and Management