Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset   Repository

S. Tamang; D. J. Bora

arXiv:2410.11291·cs.CL·October 17, 2024

Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository

S. Tamang, D. J. Bora

PDF

Open Access 1 Repo

TL;DR

This paper presents a centralized, open-source dataset repository for Assamese NLP, aiming to improve language processing tasks and foster research despite resource scarcity.

Contribution

It introduces a comprehensive dataset repository for Assamese NLP, supporting multiple tasks and encouraging collaboration in low-resource language research.

Findings

01

Repository supports sentiment analysis, NER, and translation tasks

02

Facilitates AI applications like LLMs, OCR, chatbots for Assamese

03

Highlights need for standardized datasets in low-resource languages

Abstract

This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language. The repository, available at GitHub, supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora. We review existing datasets, highlighting the need for standardized resources in Assamese NLP, and discuss potential applications in AI-driven research, such as LLMs, OCR, and chatbots. While promising, challenges like data scarcity and linguistic diversity remain. The repository aims to foster collaboration and innovation, promoting Assamese language research in the digital age.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

indian-nlp/assamese-dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques