BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement
Abdullah Al Shafi, Swapnil Kundu Argha, M. A. Moyeen, Abdul Muntakim, Shoumik Barman Polok

TL;DR
BiST is a high-quality, annotated Bangla-English corpus designed to advance multilingual NLP tasks by providing reliable sentence structure and tense labels with strong inter-annotator agreement.
Contribution
This work introduces a rigorously curated bilingual corpus with high-quality annotations and demonstrates its utility for grammatical modeling and multilingual research.
Findings
Achieved high inter-annotator agreement with Fleiss Kappa of 0.82 and 0.88.
Corpus contains 30,534 sentences from encyclopedic and conversational sources.
Baseline models using dual-encoder architectures outperform multilingual encoders.
Abstract
High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa () agreement, yielding reliable and reproducible labels with …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
