Catching UX Flaws in Code: Leveraging LLMs to Identify Usability Flaws at the Development Stage
Nolan Platt, Ethan Luchs, Sehrish Nizamani

TL;DR
This study explores the use of GPT-4o, an LLM, for early-stage usability evaluation of websites, demonstrating moderate consistency in issue detection but variability in severity assessment, highlighting both potential and limitations.
Contribution
Introduces a pipeline using GPT-4o for automated heuristic evaluation, providing one of the first quantitative analyses of inter-rater reliability in automated UX assessment.
Findings
GPT-4o achieved 84% exact agreement in issue detection.
Moderate consistency with Cohen's Kappa of 0.50 for issue detection.
Severity judgments showed lower agreement, with Krippendorff's Alpha near zero.
Abstract
Usability evaluations are essential for ensuring that modern interfaces meet user needs, yet traditional heuristic evaluations by human experts can be time-consuming and subjective, especially early in development. This paper investigates whether large language models (LLMs) can provide reliable and consistent heuristic assessments at the development stage. By applying Jakob Nielsen's ten usability heuristics to thirty open-source websites, we generated over 850 heuristic evaluations in three independent evaluations per site using a pipeline of OpenAI's GPT-4o. For issue detection, the model demonstrated moderate consistency, with an average pairwise Cohen's Kappa of 0.50 and an exact agreement of 84%. Severity judgments showed more variability: weighted Cohen's Kappa averaged 0.63, but exact agreement was just 56%, and Krippendorff's Alpha was near zero. These results suggest that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUsability and User Interface Design · Software Engineering Research · Digital Accessibility for Disabilities
