Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation

Guilherme Guerino; Luiz Rodrigues; Bruna Capeleti; Rafael Ferreira Mello; Andr\'e Freire; and Luciana Zaina

arXiv:2506.16345·cs.HC·May 12, 2026

Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation

Guilherme Guerino, Luiz Rodrigues, Bruna Capeleti, Rafael Ferreira Mello, Andr\'e Freire, and Luciana Zaina

PDF

TL;DR

This study compares GPT-4o's ability to perform heuristic evaluation of web interfaces with human experts, revealing strengths in aesthetic heuristics but challenges in flexibility and efficiency issues.

Contribution

It provides the first comparative analysis of GPT-4o's heuristic evaluation performance against human experts in web-based systems.

Findings

01

GPT-4o identified 27 new issues not found by humans.

02

GPT-4o performed better on aesthetic and minimalist heuristics.

03

GPT-4o struggled with heuristics related to flexibility and user control.

Abstract

Heuristic evaluation is a widely used method in Human-Computer Interaction (HCI) to inspect interfaces and identify issues based on heuristics. Recently, Large Language Models (LLMs), such as GPT-4o, have been applied in HCI to assist in persona creation, the ideation process, and the analysis of semi-structured interviews. However, considering the need to understand heuristics and the high degree of abstraction required to evaluate them, LLMs may have difficulty conducting heuristic evaluation. However, prior research has not investigated GPT-4o's performance in heuristic evaluation compared to HCI experts in web-based systems. In this context, this study aims to compare the results of a heuristic evaluation performed by GPT-4o and human experts. To this end, we selected a set of screenshots from a web system and asked GPT-4o to perform a heuristic evaluation based on Nielsen's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.