Performance Evaluation of Tokenizers in Large Language Models for the   Assamese Language

Sagar Tamang; Dibya Jyoti Bora

arXiv:2410.03718·cs.CL·April 8, 2025

Performance Evaluation of Tokenizers in Large Language Models for the Assamese Language

Sagar Tamang, Dibya Jyoti Bora

PDF

Open Access

TL;DR

This study evaluates the performance of different tokenizers in large language models for Assamese, a low-resource language, highlighting the superior performance of SUTRA from Two AI in terms of normalized sequence length.

Contribution

It provides a comparative analysis of tokenizer performance in Assamese LLMs, emphasizing the importance of tokenizer choice for low-resource language support.

Findings

01

SUTRA tokenizer from Two AI performs best with NSL of 0.45.

02

GPT-4o tokenizer from OpenAI closely follows with NSL of 0.54.

03

Other tokenizers like Gemma 2, Llama 3.1, and Mistral Large Instruct have higher NSL values.

Abstract

Training of a tokenizer plays an important role in the performance of deep learning models. This research aims to understand the performance of tokenizers in five state-of-the-art (SOTA) large language models (LLMs) in the Assamese language of India. The research is important to understand the multi-lingual support for a low-resourced language such as Assamese. Our research reveals that the tokenizer of SUTRA from Two AI performs the best with an average Normalized Sequence Length (NSL) value of 0.45, closely followed by the tokenizer of GPT-4o from Open AI with an average NSL value of 0.54, followed by Gemma 2, Meta Llama 3.1, and Mistral Large Instruct 2407 with an average NSL value of 0.82, 1.4, and 1.48 respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Linguistics and Cultural Studies

MethodsLLaMA