IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Akhilesh Aravapalli; Mounika Marreddy; Radhika Mamidi; Manish Gupta; Subba Reddy Oota

arXiv:2410.02611·cs.CL·November 4, 2025

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Akhilesh Aravapalli, Mounika Marreddy, Radhika Mamidi, Manish Gupta, Subba Reddy Oota

PDF

Open Access

TL;DR

This study evaluates how well multilingual Transformer models encode linguistic properties and their robustness across 13 Indic languages using a new benchmark dataset, revealing strengths and weaknesses in encoding and robustness.

Contribution

Introduces IndicSentEval, a novel benchmark dataset for probing multilingual models on Indic languages, and analyzes encoding and robustness across multiple models and perturbations.

Findings

01

Indic-specific models better encode Indic linguistic properties.

02

Universal models show greater robustness to input perturbations.

03

Multilingual models perform well on English but have mixed results on Indic languages.

Abstract

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$ 47K sentences. Surprisingly, our probing analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Adam · WordPiece · Attention Dropout · Linear Layer · Residual Connection · Weight Decay · Position-Wise Feed-Forward Layer