TL;DR
KinyaColBERT is a novel retrieval model designed for low-resource languages, combining morphology-based tokenization and late word-level interactions to improve retrieval accuracy in retrieval-augmented generation systems.
Contribution
The paper introduces KinyaColBERT, a new retrieval model that enhances low-resource language retrieval through morphology-based tokenization and late interaction mechanisms.
Findings
KinyaColBERT outperforms existing baselines on Kinyarwanda retrieval tasks.
Morphology-based tokenization improves language coverage and retrieval accuracy.
The model offers a cost-effective solution for low-resource RAG applications.
Abstract
The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
