Nepali Passport Question Answering: A Low-Resource Dataset for Public Service Applications
Funghang Limbu Begha, Praveen Acharya, Bal Krishna Bal

TL;DR
This paper introduces a Nepali question-answer dataset for passport services and demonstrates that fine-tuned multilingual embedding models outperform traditional retrieval methods in this low-resource language setting.
Contribution
It creates a novel Nepali FAQ dataset for passport services and evaluates transformer-based models, showing improved retrieval performance over baseline methods.
Findings
Multilingual E5 embeddings achieve the highest retrieval accuracy.
Fine-tuned SBERT models outperform BM25 baseline.
Hybrid retrieval combining models enhances performance.
Abstract
Nepali, a low-resource language, faces significant challenges in building an effective information retrieval system due to the unavailability of annotated data and computational linguistic resources. In this study, we attempt to address this gap by preparing a pair-structured Nepali Question-Answer dataset. We focus on Frequently Asked Questions (FAQs) for passport-related services, building a data set for training and evaluation of IR models. In our study, we have fine-tuned transformer-based embedding models for semantic similarity in question-answer retrieval. The fine-tuned models were compared with the baseline BM25. In addition, we implement a hybrid retrieval approach, integrating fine-tuned models with BM25, and evaluate the performance of the hybrid retrieval. Our results show that the fine-tuned SBERT-based models outperform BM25, whereas multilingual E5 embedding-based models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Information Retrieval and Search Behavior
