Too Late to Train, Too Early To Use? A Study on Necessity and Viability   of Low-Resource Bengali LLMs

Tamzeed Mahfuz; Satak Kumar Dey; Ruwad Naswan; Hasnaen Adil; Khondker; Salman Sayeed; Haz Sameen Shahgir

arXiv:2407.00416·cs.CL·December 16, 2024·1 cites

Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs

Tamzeed Mahfuz, Satak Kumar Dey, Ruwad Naswan, Hasnaen Adil, Khondker, Salman Sayeed, Haz Sameen Shahgir

PDF

Open Access

TL;DR

This study evaluates the necessity and viability of dedicated Bengali LLMs, comparing existing models' performance on various tasks and highlighting challenges like tokenization inefficiencies and dataset biases.

Contribution

It provides a comprehensive comparison of open-weight and closed-source LLMs on Bengali tasks and discusses the current limitations hindering the development of effective Bengali-specific LLMs.

Findings

01

LLMs perform variably on Bengali script generation tasks

02

Tokenization inefficiencies increase computational costs

03

Biases in machine-translated datasets affect model performance

Abstract

Each new generation of English-oriented Large Language Models (LLMs) exhibits enhanced cross-lingual transfer capabilities and significantly outperforms older LLMs on low-resource languages. This prompts the question: Is there a need for LLMs dedicated to a particular low-resource language? We aim to explore this question for Bengali, a low-to-moderate resource Indo-Aryan language native to the Bengal region of South Asia. We compare the performance of open-weight and closed-source LLMs such as LLaMA-3 and GPT-4 against fine-tuned encoder-decoder models across a diverse set of Bengali downstream tasks, including translation, summarization, paraphrasing, question-answering, and natural language inference. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent. Key challenges include…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Electricity Theft Detection Techniques · Open Education and E-Learning

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Multi-Head Attention · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer