KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

Antoine Nzeyimana; Andre Niyongabo Rubungo

arXiv:2507.03241·cs.CL·July 8, 2025

KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation

Antoine Nzeyimana, Andre Niyongabo Rubungo

PDF

2 Models

TL;DR

KinyaColBERT is a novel retrieval model designed for low-resource languages, combining morphology-based tokenization and late word-level interactions to improve retrieval accuracy in retrieval-augmented generation systems.

Contribution

The paper introduces KinyaColBERT, a new retrieval model that enhances low-resource language retrieval through morphology-based tokenization and late interaction mechanisms.

Findings

01

KinyaColBERT outperforms existing baselines on Kinyarwanda retrieval tasks.

02

Morphology-based tokenization improves language coverage and retrieval accuracy.

03

The model offers a cost-effective solution for low-resource RAG applications.

Abstract

The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.