Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation
Guilherme Guerino, Luiz Rodrigues, Bruna Capeleti, Rafael Ferreira Mello, Andr\'e Freire, and Luciana Zaina

TL;DR
This study compares GPT-4o's ability to perform heuristic evaluation of web interfaces with human experts, revealing strengths in aesthetic heuristics but challenges in flexibility and efficiency issues.
Contribution
It provides the first comparative analysis of GPT-4o's heuristic evaluation performance against human experts in web-based systems.
Findings
GPT-4o identified 27 new issues not found by humans.
GPT-4o performed better on aesthetic and minimalist heuristics.
GPT-4o struggled with heuristics related to flexibility and user control.
Abstract
Heuristic evaluation is a widely used method in Human-Computer Interaction (HCI) to inspect interfaces and identify issues based on heuristics. Recently, Large Language Models (LLMs), such as GPT-4o, have been applied in HCI to assist in persona creation, the ideation process, and the analysis of semi-structured interviews. However, considering the need to understand heuristics and the high degree of abstraction required to evaluate them, LLMs may have difficulty conducting heuristic evaluation. However, prior research has not investigated GPT-4o's performance in heuristic evaluation compared to HCI experts in web-based systems. In this context, this study aims to compare the results of a heuristic evaluation performed by GPT-4o and human experts. To this end, we selected a set of screenshots from a web system and asked GPT-4o to perform a heuristic evaluation based on Nielsen's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
