Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal

Ambalika Guha; Sajal Saha; Debanjan Ballav; Soumi Mitra; Hritwick Chakraborty

arXiv:2510.22629·cs.CL·October 28, 2025

Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal

Ambalika Guha, Sajal Saha, Debanjan Ballav, Soumi Mitra, Hritwick Chakraborty

PDF

TL;DR

This paper presents a comprehensive approach combining linguistic analysis and AI technology to document, standardize, and develop digital tools for the endangered Toto language, aiding its preservation and revitalization.

Contribution

It introduces a novel trilingual corpus and AI models for the Toto language, integrating linguistic features with digital tools for language preservation.

Findings

01

Created a morpheme-tagged trilingual corpus

02

Developed a transformer-based translation engine

03

Enhanced script standardization and digital literacy

Abstract

Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.