Querying Structured Data Through Natural Language Using Language Models
Hontan Valentin-Micu, Bunea Andrei-Alexandru, Tantaroudas Nikolaos Dimitrios, Popovici Dan-Matei

TL;DR
This paper introduces a methodology for querying structured datasets with natural language by training a compact language model to generate executable queries, enabling resource-efficient and accurate data access.
Contribution
It presents a pipeline for synthetic data generation and fine-tunes a small model to handle structured data queries, outperforming larger models in resource-constrained settings.
Findings
High accuracy across monolingual, multilingual, and unseen locations
Effective on a dataset about accessibility to essential services in Spain
Small models can achieve high precision without large proprietary LLMs
Abstract
This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
