Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

K. M. Jubair Sami; Dipto Sumit; Ariyan Hossain; Farig Sadeque

arXiv:2603.21359·cs.CL·March 24, 2026

Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque

PDF

Open Access

TL;DR

This paper introduces a comprehensive framework to evaluate dialectal bias in Bengali language models, combining novel translation quality assessment, a large benchmark dataset, and a bias sensitivity metric, revealing significant performance disparities across dialects.

Contribution

It presents a multi-stage evaluation framework integrating RAG-based translation and human-augmented RLAIF, along with a new bias sensitivity metric and a benchmark dataset for Bengali dialects.

Findings

01

Significant performance drops in dialectal question-answering accuracy.

02

Traditional translation metrics are ineffective for dialects; LLM-based evaluation correlates better with human judgment.

03

Model scale does not consistently reduce dialectal bias.

Abstract

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Linguistic Variation and Morphology · Authorship Attribution and Profiling