BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, and Nazmul Siddique

TL;DR
This paper introduces a BETA-labeling framework using multiple LLMs for creating a high-quality Bangla IR dataset and investigates the challenges of reusing low-resource language datasets through machine translation, highlighting biases and semantic inconsistencies.
Contribution
It presents a novel LLM-based labeling framework for low-resource IR datasets and evaluates the feasibility and limitations of cross-lingual dataset reuse via machine translation.
Findings
BETA-labeling improves dataset quality through multi-annotator agreement and human verification.
Cross-lingual reuse of IR datasets shows significant semantic and bias-related variations.
LLM-based translation impacts meaning preservation, affecting dataset reliability.
Abstract
IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Mobile Crowdsensing and Crowdsourcing
