BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

Md. Najib Hasan; Mst. Jannatun Ferdous Rain; Fyad Mohammed; and Nazmul Siddique

arXiv:2602.14488·cs.CL·February 24, 2026

BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, and Nazmul Siddique

PDF

Open Access

TL;DR

This paper introduces a BETA-labeling framework using multiple LLMs for creating a high-quality Bangla IR dataset and investigates the challenges of reusing low-resource language datasets through machine translation, highlighting biases and semantic inconsistencies.

Contribution

It presents a novel LLM-based labeling framework for low-resource IR datasets and evaluates the feasibility and limitations of cross-lingual dataset reuse via machine translation.

Findings

01

BETA-labeling improves dataset quality through multi-annotator agreement and human verification.

02

Cross-lingual reuse of IR datasets shows significant semantic and bias-related variations.

03

LLM-based translation impacts meaning preservation, affecting dataset reliability.

Abstract

IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning and Data Classification · Mobile Crowdsensing and Crowdsourcing