A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs
K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque

TL;DR
This paper compares two retrieval-augmented generation techniques for translating Bengali from standard to dialect forms, demonstrating that a structured sentence-pair approach significantly improves accuracy and enables smaller models to outperform larger ones.
Contribution
Introduces and evaluates two novel RAG pipelines for Bengali dialect translation, highlighting the effectiveness of structured sentence pairs over transcript-based methods.
Findings
Sentence-pair pipeline reduces WER from 76% to 55% for Chittagong dialect.
Smaller models with retrieval outperform larger models without retrieval.
Proposes a practical, fine-tuning-free solution for low-resource dialect translation.
Abstract
Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
