Semantic Invariance in Agentic AI
I. de Zarz\`a, J. de Curt\`o, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

TL;DR
This paper introduces a metamorphic testing framework to evaluate the semantic invariance of large language models acting as reasoning agents, revealing that larger models are not necessarily more robust to input variations.
Contribution
We develop a systematic testing method using semantic-preserving transformations to assess LLM reasoning robustness across multiple models and domains.
Findings
Smaller models like Qwen3-30B-A3B show higher robustness (79.6%) than larger models.
Model scale does not correlate with increased robustness.
Semantic invariance varies significantly across models and transformations.
Abstract
Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance. Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications
