BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement

Abdullah Al Shafi; Swapnil Kundu Argha; M. A. Moyeen; Abdul Muntakim; Shoumik Barman Polok

arXiv:2604.04708·cs.CL·April 7, 2026

BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement

Abdullah Al Shafi, Swapnil Kundu Argha, M. A. Moyeen, Abdul Muntakim, Shoumik Barman Polok

PDF

TL;DR

BiST is a high-quality, annotated Bangla-English corpus designed to advance multilingual NLP tasks by providing reliable sentence structure and tense labels with strong inter-annotator agreement.

Contribution

This work introduces a rigorously curated bilingual corpus with high-quality annotations and demonstrates its utility for grammatical modeling and multilingual research.

Findings

01

Achieved high inter-annotator agreement with Fleiss Kappa of 0.82 and 0.88.

02

Corpus contains 30,534 sentences from encyclopedic and conversational sources.

03

Baseline models using dual-encoder architectures outperform multilingual encoders.

Abstract

High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa ( $κ$ ) agreement, yielding reliable and reproducible labels with $κ$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.