Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

Anusha Kamath; Kanishk Singla; Rakesh Paul; Raviraj Joshi; Utkarsh Vaidya; Sanjay Singh Chauhan; Niranjan Wartikar

arXiv:2508.19831·cs.CL·October 16, 2025

Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis

Anusha Kamath, Kanishk Singla, Rakesh Paul, Raviraj Joshi, Utkarsh Vaidya, Sanjay Singh Chauhan, Niranjan Wartikar

PDF

5 Datasets

TL;DR

This paper introduces a new set of high-quality Hindi evaluation datasets for LLMs, enabling comprehensive benchmarking and analysis of their performance in capturing linguistic and cultural nuances.

Contribution

It presents a novel methodology for creating Hindi benchmarks combining human annotation and translation verification, along with a comparative analysis of open-source Hindi LLMs.

Findings

01

New Hindi evaluation datasets introduced

02

Benchmarking reveals strengths and weaknesses of existing LLMs in Hindi

03

Methodology applicable to other low-resource languages

Abstract

Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.