A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search
Riccardo Terrenzi, Phongsakon Mark Konrad, Tim Lukas Adam, Serkan Ayvaz

TL;DR
This paper introduces a modular, auditable reference architecture for hybrid dataset search combining lexical, dense retrieval, and LLM orchestration to improve matching of natural language queries against heterogeneous metadata.
Contribution
It proposes a novel software architecture for dataset search that integrates LLM planning, retrieval, and augmentation, with analysis of architectural tradeoffs and governance tactics.
Findings
Hybrid retrieval with LLM orchestration improves dataset search effectiveness.
Metadata augmentation with pseudo-queries enhances retrieval recall.
Architectural analysis reveals tradeoffs in modifiability, observability, and performance.
Abstract
Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records, a task where typical lexical or dense retrieval alone falls short. We reposition dataset search as a software-architecture problem and propose a bounded, auditable reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense-embedding retrieval via reciprocal rank fusion (RRF), orchestrated by a large language model (LLM) agent that repeatedly plans queries, evaluates the sufficiency of results, and reranks candidates. To reduce the vocabulary mismatch between user intent and provider-authored metadata, we introduce an offline metadata augmentation step in which an LLM generates pseudo-queries for each dataset record, augmenting both retrieval indexes before query time. Two architectural styles are examined: a single ReAct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
