Metamorphic Testing of Large Language Models for Natural Language Processing

Steven Cho; Stefano Ruberto; Valerio Terragni

arXiv:2511.02108·cs.SE·January 12, 2026

Metamorphic Testing of Large Language Models for Natural Language Processing

Steven Cho, Stefano Ruberto, Valerio Terragni

PDF

Open Access

TL;DR

This paper explores the use of metamorphic testing to identify faulty behaviors in large language models for NLP, providing a comprehensive review, a set of relations, and extensive experiments with promising insights.

Contribution

It offers the most extensive study of metamorphic testing for LLMs, including a large set of metamorphic relations and empirical evaluation on multiple models.

Findings

01

MT can effectively expose faulty behaviors in LLMs

02

Collected 191 metamorphic relations for NLP tasks

03

Conducted 560,000 metamorphic tests across three LLMs

Abstract

Using large language models (LLMs) to perform natural language processing (NLP) tasks has become increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that they can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limited availability of labeled datasets, which necessitates an oracle to determine the correctness of LLM behaviors. Metamorphic testing (MT) is a popular testing approach that alleviates this oracle problem. At the core of MT are metamorphic relations (MRs), which define relationships between the outputs of related inputs. MT can expose faulty behaviors without the need for explicit oracles (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)