Tatarstan Toponyms: A Bilingual Dataset and Hybrid RAG System for Geospatial Question Answering

Mullosharaf K. Arabov

arXiv:2605.05962·cs.CL·May 8, 2026

Tatarstan Toponyms: A Bilingual Dataset and Hybrid RAG System for Geospatial Question Answering

Mullosharaf K. Arabov

PDF

1 Repo

TL;DR

This paper introduces a bilingual toponym dataset and a hybrid retrieval system for geospatial question answering, achieving high accuracy and outperforming existing methods.

Contribution

It presents a novel bilingual toponym dataset and a hybrid retrieval approach combining semantic and spatial filtering for improved geospatial QA.

Findings

01

Hybrid retriever achieves Recall@1=0.988 and MRR=0.994.

02

XLM-RoBERTa-large attains EM=0.992 and F1=0.994.

03

Resources are openly available on Hugging Face.

Abstract

This paper addresses automatic geospatial question answering over multilingual toponymic data. An original bilingual dataset of toponyms of the Republic of Tatarstan is introduced, comprising 9,688 structured records with linguistic, etymological, administrative, and coordinate information (93.1% georeferenced). Based on this dataset, a question-answering corpus of approximately 39,000 question-context-answer triples is constructed with guaranteed answer localization. A hybrid retriever integrates dense semantic indexing (multilingual-e5-large) with geospatial filtering via KD-trees and haversine distance. On 500 test queries, the hybrid search achieves Recall@1=0.988, Recall@5=1.000, and MRR=0.994, significantly outperforming BM25 and purely spatial methods. Among tested reader architectures (RuBERT, XLM-RoBERTa-large, T5-RUS), XLM-RoBERTa-large attains the best quality: EM=0.992,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.