Evaluating Robustness of Large Language Models in Enterprise Applications: Benchmarks for Perturbation Consistency Across Formats and Languages
Tara Bogavelli, Oluwanifemi Bamgbose, Gabrielle Gauthier Melan\c{c}on, Fanny Riols, Roshnee Sharma

TL;DR
This paper introduces a comprehensive benchmark suite to evaluate the robustness of large language models in enterprise scenarios, focusing on perturbation consistency across formats and languages, revealing size-robustness nuances.
Contribution
It presents a new benchmark suite for assessing LLM robustness across diverse perturbations and evaluates 11 models, uncovering complex relationships between size and robustness.
Findings
Minor perturbations can reduce model performance by up to 40%
Model size does not linearly correlate with robustness
An 8B model outperforms some larger models in robustness
Abstract
Enterprise LLM applications require consistently high quality and reliable performance across diverse scenarios, demanding robustness to minor variations. Existing research shows that even small prompt changes can lead to substantial differences in output, but has mainly focused on a narrow set of perturbations with small academic datasets, limiting their relevance to real-world applications. To address this, we present a comprehensive benchmark suite that evaluates robustness across multiple perturbation types, including general text edits (e.g., punctuation, whitespace), formatting changes (e.g., JSON, YAML), multilingual and cross-lingual inputs, and positional variations in instructions. Evaluating 11 models ranging from 4B to 120B+ parameters, we find that minor perturbations reduce performance by up to 40 percentage points on key enterprise metrics. Critically, we demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Software Engineering Research
