IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Thanmay Jayakumar; Mohammed Safi Ur Rahman Khan; Raj Dabre; Ratish Puduppully; Anoop Kunchukuttan

arXiv:2602.22125·cs.CL·February 26, 2026

IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages

Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan

PDF

Open Access 1 Datasets

TL;DR

IndicIFEval is a new benchmark designed to evaluate the performance of large language models in following instructions across 14 Indic languages, addressing a critical gap in multilingual NLP evaluation.

Contribution

The paper introduces IndicIFEval, a comprehensive benchmark with verified examples for 14 Indic languages, enabling systematic evaluation of LLMs' instruction-following capabilities in these languages.

Findings

01

Models perform well on formatting but struggle with lexical and cross-lingual tasks.

02

Progress in high-resource Indic languages is notable, but overall performance lags behind English.

03

IndicIFEval and scripts are publicly released to foster further research.

Abstract

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ai4bharat/IndicIFEval
dataset· 98 dl
98 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling