Metamorphic Testing of Large Language Models for Natural Language Processing
Steven Cho, Stefano Ruberto, Valerio Terragni

TL;DR
This paper explores the use of metamorphic testing to identify faulty behaviors in large language models for NLP, providing a comprehensive review, a set of relations, and extensive experiments with promising insights.
Contribution
It offers the most extensive study of metamorphic testing for LLMs, including a large set of metamorphic relations and empirical evaluation on multiple models.
Findings
MT can effectively expose faulty behaviors in LLMs
Collected 191 metamorphic relations for NLP tasks
Conducted 560,000 metamorphic tests across three LLMs
Abstract
Using large language models (LLMs) to perform natural language processing (NLP) tasks has become increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that they can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limited availability of labeled datasets, which necessitates an oracle to determine the correctness of LLM behaviors. Metamorphic testing (MT) is a popular testing approach that alleviates this oracle problem. At the core of MT are metamorphic relations (MRs), which define relationships between the outputs of related inputs. MT can expose faulty behaviors without the need for explicit oracles (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
