Non-programmers Assessing AI-Generated Code: A Case Study of Business Users Analyzing Data
Yuvraj Virk, Dongyu Liu

TL;DR
This study investigates whether non-technical business users can effectively evaluate AI-generated data analyses, revealing significant challenges in identifying errors and emphasizing the need for improved AI explanation and oversight methods.
Contribution
It provides empirical evidence that non-programmers struggle to critically assess AI-generated code and analyses, highlighting the necessity for better evaluation tools and explanations.
Findings
Participants often failed to detect critical AI errors.
Reformatting AI responses improved critical evaluation somewhat.
Business users cannot reliably verify AI-generated analyses without additional support.
Abstract
Non-technical end-users increasingly rely on AI code generation to perform technical tasks like data analysis. However, large language models (LLMs) remain unreliable, and it is unclear whether end-users can effectively identify model errors especially in realistic and domain-specific scenarios. We surveyed marketing and sales professionals to assess their ability to critically evaluate LLM-generated analyses of marketing data. Participants were shown natural language explanations of the AI's code, repeatedly informed the AI often makes mistakes, and explicitly prompted to identify them. Yet, participants frequently failed to detect critical flaws that could compromise decision-making, many of which required no technical knowledge to recognize. To investigate why, we reformatted AI responses into clearly delineated steps and provided alternative approaches for each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI
