Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Bogdan Kosti\'c; Conor Fallon; Julian Risch; Alexander L\"oser

arXiv:2602.17316·cs.CL·February 20, 2026

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Bogdan Kosti\'c, Conor Fallon, Julian Risch, Alexander L\"oser

PDF

Open Access

TL;DR

This paper investigates how small lexical and syntactic changes in input prompts significantly impact the performance and ranking of large language models, revealing their reliance on surface-level patterns over linguistic understanding.

Contribution

It introduces controlled perturbation pipelines to systematically evaluate LLM robustness, highlighting vulnerabilities and the need for improved evaluation standards.

Findings

01

Lexical perturbations cause significant performance drops across models.

02

Syntactic changes have mixed effects, sometimes improving results.

03

Model robustness does not scale consistently with size.

Abstract

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education