Metamorphic Evaluation of ChatGPT as a Recommender System
Madhurima Khirbat, Yongli Ren, Pablo Castells, Mark Sanderson

TL;DR
This paper introduces a metamorphic testing framework to evaluate ChatGPT-based recommender systems, revealing the need for specialized evaluation methods due to their probabilistic and black-box nature.
Contribution
It proposes a novel metamorphic testing approach for LLM-based recommenders, addressing the limitations of traditional evaluation metrics for these models.
Findings
Lower similarity scores indicate inconsistencies in GPT-based recommendations
Traditional metrics are insufficient for evaluating LLM-based recommender systems
Metamorphic testing reveals the need for comprehensive evaluation methods
Abstract
With the rise of Large Language Models (LLMs) such as ChatGPT, researchers have been working on how to utilize the LLMs for better recommendations. However, although LLMs exhibit black-box and probabilistic characteristics (meaning their internal working is not visible), the evaluation framework used for assessing these LLM-based recommender systems (RS) are the same as those used for traditional recommender systems. To address this gap, we introduce the metamorphic testing for the evaluation of GPT-based RS. This testing technique involves defining of metamorphic relations (MRs) between the inputs and checking if the relationship has been satisfied in the outputs. Specifically, we examined the MRs from both RS and LLMs perspectives, including rating multiplication/shifting in RS and adding spaces/randomness in the LLMs prompt via prompt perturbation. Similarity metrics (e.g. Kendall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
