Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages

Tara Bogavelli; Oluwanifemi Bamgbose; Gabrielle Gauthier Melan\c{c}on; Fanny Riols; Roshnee Sharma

arXiv:2601.06341·cs.LG·January 13, 2026

Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages

Tara Bogavelli, Oluwanifemi Bamgbose, Gabrielle Gauthier Melan\c{c}on, Fanny Riols, Roshnee Sharma

PDF

Open Access

TL;DR

This paper introduces a comprehensive benchmark suite to evaluate the robustness of large language models in enterprise scenarios, focusing on perturbation consistency across formats and languages, revealing size-robustness nuances.

Contribution

It presents a new benchmark suite for assessing LLM robustness across diverse perturbations and evaluates 11 models, uncovering complex relationships between size and robustness.

Findings

01

Minor perturbations can reduce model performance by up to 40%

02

Model size does not linearly correlate with robustness

03

An 8B model outperforms some larger models in robustness

Abstract

Enterprise LLM applications require consistently high quality and reliable performance across diverse scenarios, demanding robustness to minor variations. Existing research shows that even small prompt changes can lead to substantial differences in output, but has mainly focused on a narrow set of perturbations with small academic datasets, limiting their relevance to real-world applications. To address this, we present a comprehensive benchmark suite that evaluates robustness across multiple perturbation types, including general text edits (e.g., punctuation, whitespace), formatting changes (e.g., JSON, YAML), multilingual and cross-lingual inputs, and positional variations in instructions. Evaluating 11 models ranging from 4B to 120B+ parameters, we find that minor perturbations reduce performance by up to 40 percentage points on key enterprise metrics. Critically, we demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Software Engineering Research