BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan

TL;DR
BhashaBench V1 is a comprehensive, domain-specific bilingual benchmark designed to evaluate large language models on India-centric knowledge across multiple domains, highlighting significant performance gaps especially in low-resource areas.
Contribution
It introduces the first multi-task, bilingual benchmark for Indian domains, with extensive curated data, enabling detailed evaluation of LLMs' domain-specific and bilingual capabilities.
Findings
Models perform better on English than Hindi across domains.
Significant performance gaps exist in low-resource domains.
Subdomain analysis reveals areas of relative strength and weakness.
Abstract
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance,…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The dataset is a valuable contribution, as the authors curate a benchmark that addresses the underrepresentation of benchmarks that evaluate language models in Indic-centric contexts. - The benchmark contains questions across diverse topics and subdomains and enables fine-grained evaluations across all. - Their experiments cut across both frontier and open-source models, revealing disparities in model performance. - The bilingual focus of their model is very relevant for evaluating models '
- The evaluation relies heavily on multiple-choice questions. MCQs are easy to evaluate and have been employed across several benchmarks. There are still questions surrounding whether they truly capture model understanding or reasoning abilities. Are the distractors sufficiently challenging, particularly on domains where performance is above 90%? What happens if you scale the options? Previous work has shown that sometimes scaling distractors beyond 4 options increases the difficulty of guessi
1. The paper introduces a new benchmark that includes questions related to less-represented domains such as Agriculture, Ayurveda, Legal and Finance in the Indian cultural context. 2. The authors use a sound multi-step data preparation pipeline which focuses on ensuring quality by using formatting pipelines and manual validation by experts. 3. The benchmark is composed of diverse question types and represents 90+ subdomains. 4. The authors evaluate open and closed source models of various siz
1. The benchmark has a bias towards exam-style questions since it is sourced from professional and government exams. Similarly, it covers a limited set of domains with 70% of the data being in English.
1. the benchmark dataset is very rich in terms of (a) number of sample points (QA), (b) in language (i.e., non-english) and (c) domain/subdomains focused on. 2. the number of models being evaluated is huge and ranges from small param to larger -> so the conclusions/insights are extensive since the work covers many models
1. I am excited to see the extensive number of llms being evaluated however it would be helpful to have some sort of qualitative error analysis to find the patterns where them model(s) fail and perhaps that will indicate "why" they fail. the current results list aggregate scores across domains/subdomains, but there’s little insight into failure modes (common errors). Having some sort of error taxonomy with manual audits per domain and difficulty might be a useful information to readers 2. the r
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
