BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation
Kazi Reyazul Hasan, Mubasshira Musarrat, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan

TL;DR
BanglaSTEM is a curated dataset of 5,000 high-quality Bangla-English sentence pairs in STEM fields, enabling improved translation of technical content for better use of English language models by Bangla speakers.
Contribution
The paper introduces BanglaSTEM, a specialized dataset and a T5-based translation model that significantly enhance technical translation accuracy in Bangla-English translation tasks.
Findings
Improved translation accuracy for technical terms.
Effective translation of STEM content in Bangla-English.
Public release of dataset and model for research use.
Abstract
Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
