TL;DR
This paper introduces a bilingual toponym dataset and a hybrid retrieval system for geospatial question answering, achieving high accuracy and outperforming existing methods.
Contribution
It presents a novel bilingual toponym dataset and a hybrid retrieval approach combining semantic and spatial filtering for improved geospatial QA.
Findings
Hybrid retriever achieves Recall@1=0.988 and MRR=0.994.
XLM-RoBERTa-large attains EM=0.992 and F1=0.994.
Resources are openly available on Hugging Face.
Abstract
This paper addresses automatic geospatial question answering over multilingual toponymic data. An original bilingual dataset of toponyms of the Republic of Tatarstan is introduced, comprising 9,688 structured records with linguistic, etymological, administrative, and coordinate information (93.1% georeferenced). Based on this dataset, a question-answering corpus of approximately 39,000 question-context-answer triples is constructed with guaranteed answer localization. A hybrid retriever integrates dense semantic indexing (multilingual-e5-large) with geospatial filtering via KD-trees and haversine distance. On 500 test queries, the hybrid search achieves Recall@1=0.988, Recall@5=1.000, and MRR=0.994, significantly outperforming BM25 and purely spatial methods. Among tested reader architectures (RuBERT, XLM-RoBERTa-large, T5-RUS), XLM-RoBERTa-large attains the best quality: EM=0.992,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
