A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering
Sereiwathna Ros, Phannet Pov, Ratanaktepi Chhor, Kimleang Ly, Wan-Sup Cho, and Saksonita Khoeurn

TL;DR
This study evaluates retrieval-augmented question answering systems for Khmer, comparing different embedding models and generators, revealing that retriever choice significantly impacts performance and no single model excels across all metrics.
Contribution
It provides the first comprehensive benchmark of RAG components for Khmer, highlighting the importance of retriever selection and analyzing generator performance.
Findings
BGE-M3 embedding model outperforms others in Khmer document retrieval.
No single generator model dominates across all evaluation metrics.
Retriever choice is a key bottleneck in Khmer RAG systems.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
