Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents
Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn

TL;DR
This study compares four text chunking strategies for improving text embedding and retrieval in Khmer agricultural documents within a RAG framework, highlighting the effectiveness of recursive chunking.
Contribution
It introduces a systematic evaluation of chunking methods for low-resource languages, demonstrating the superiority of recursive chunking in this context.
Findings
Recursive chunking achieved the lowest L2 distance and highest relevance scores.
Statistically significant improvement over sentence-based chunking in L2 distance.
Character-based recursive chunking outperforms other methods in retrieval metrics.
Abstract
In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
