Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

Sovandara Chhoun; Pichdara Po; Sereiwathna Ros; Wan-Sup Cho; Saksonita Khoeurn

arXiv:2605.22203·cs.CL·May 22, 2026

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn

PDF

TL;DR

This study compares four text chunking strategies for improving text embedding and retrieval in Khmer agricultural documents within a RAG framework, highlighting the effectiveness of recursive chunking.

Contribution

It introduces a systematic evaluation of chunking methods for low-resource languages, demonstrating the superiority of recursive chunking in this context.

Findings

01

Recursive chunking achieved the lowest L2 distance and highest relevance scores.

02

Statistically significant improvement over sentence-based chunking in L2 distance.

03

Character-based recursive chunking outperforms other methods in retrieval metrics.

Abstract

In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.