Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?
Bo Wang, Yiqiao Li, Jianlong Zhou, Fang Chen

TL;DR
This paper explores the potential of large language models to evaluate machine learning explanations, comparing their assessment capabilities with human judges in an iris classification context.
Contribution
It introduces a workflow combining LLM-based and human evaluation for explanations and assesses LLMs' effectiveness as explanation judges.
Findings
LLMs effectively evaluate explanations with subjective metrics.
LLMs are not yet ready to replace human judgment in explanation evaluation.
The proposed workflow facilitates combined evaluation approaches.
Abstract
EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsADaptive gradient method with the OPTimal convergence rate
