LLMORPH: Automated Metamorphic Testing of Large Language Models

Steven Cho; Stefano Ruberto; and Valerio Terragni

arXiv:2603.23611·cs.SE·March 26, 2026·ASE

LLMORPH: Automated Metamorphic Testing of Large Language Models

Steven Cho, Stefano Ruberto, and Valerio Terragni

PDF

Open Access

TL;DR

LLMORPH is an automated testing tool that uses Metamorphic Testing to evaluate the robustness of large language models in NLP tasks without relying on labeled data, revealing faulty behaviors effectively.

Contribution

This paper introduces LLMORPH, a novel automated testing framework leveraging Metamorphic Testing for LLMs, adaptable to various models and NLP tasks, without needing human-labeled data.

Findings

01

Successfully tested three state-of-the-art LLMs using 36 MRs.

02

Generated over 561,000 test executions demonstrating effectiveness.

03

Effectively exposed inconsistencies in LLM outputs.

Abstract

Automated testing is essential for evaluating and improving the reliability of Large Language Models (LLMs), yet the lack of automated oracles for verifying output correctness remains a key challenge. We present LLMORPH, an automated testing tool specifically designed for LLMs performing NLP tasks, which leverages Metamorphic Testing (MT) to uncover faulty behaviors without relying on human-labeled data. MT uses Metamorphic Relations (MRs) to generate follow-up inputs from source test input, enabling detection of inconsistencies in model outputs without the need of expensive labelled data. LLMORPH is aimed at researchers and developers who want to evaluate the robustness of LLM-based NLP systems. In this paper, we detail the design, implementation, and practical usage of LLMORPH, demonstrating how it can be easily extended to any LLM, NLP task, and set of MRs. In our evaluation, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Testing and Debugging Techniques