Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation
Ruican Zhong, David W. McDonald, Gary Hsieh

TL;DR
This paper introduces a synthetic heuristic evaluation method using multimodal LLMs to identify usability issues in app designs, outperforming human evaluators in some aspects and offering a cost-effective alternative.
Contribution
We developed a novel synthetic heuristic evaluation approach leveraging multimodal LLMs, demonstrating its effectiveness compared to experienced human evaluators.
Findings
Synthetic evaluation identified 73-77% of usability issues.
Synthetic evaluation outperformed human evaluators in detecting layout issues.
Performance of synthetic evaluation remained stable over time and across accounts.
Abstract
Usability evaluation is crucial in human-centered design but can be costly, requiring expert time and user compensation. In this work, we developed a method for synthetic heuristic evaluation using multimodal LLMs' ability to analyze images and provide design feedback. Comparing our synthetic evaluations to those by experienced UX practitioners across two apps, we found our evaluation identified 73% and 77% of usability issues, which exceeded the performance of 5 experienced human evaluators (57% and 63%). Compared to human evaluators, the synthetic evaluation's performance maintained consistent performance across tasks and excelled in detecting layout issues, highlighting potential attentional and perceptual strengths of synthetic evaluation. However, synthetic evaluation struggled with recognizing some UI components and design conventions, as well as identifying across screen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
