Is Sanskrit the most token-efficient language? A quantitative study using GPT, Gemini, and SentencePiece
Anshul Kumar

TL;DR
This study quantifies Sanskrit's high token efficiency in LLMs compared to English and Hindi, revealing potential for cost savings and faster inference, and highlights the impact of tokenizer bias on non-English languages.
Contribution
It provides the first quantitative analysis of Sanskrit's token efficiency in modern tokenizers, demonstrating its compactness and implications for LLM cost and bias.
Findings
Sanskrit has approximately half the token count of English and Hindi.
Latest tokenizers reduce bias but do not fully capture Sanskrit's compactness.
Sanskrit's morphology offers significant potential for efficient language modeling.
Abstract
Tokens are the basic units of Large Language Models (LLMs). LLMs rely on tokenizers to segment text into these tokens, and tokenization is the primary determinant of computational and inference cost. Sanskrit, one of the oldest languages, is hypothesized to express more meaning per token due to its morphology and grammar rules; however, no prior work has quantified this. We use a dataset of 701 parallel verses of the Bhagavad Gita, which comprises three languages-Sanskrit, English, and Hindi along with transliteration of Sanskrit into English. We test tokenizers including SentencePiece (SPM), older GPT models, and the latest generation tokenizers from Gemini and GPT. We use metrics of token count, characters per token (token efficiency), and tokens per character (token cost). Results show a ~2x difference in token counts between Sanskrit and English/Hindi under the unbiased SPM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
