A Case Study of Web App Coding with OpenAI Reasoning Models

Yi Cui

arXiv:2409.13773·cs.SE·September 24, 2024

A Case Study of Web App Coding with OpenAI Reasoning Models

Yi Cui

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This study evaluates OpenAI's latest reasoning models on web app coding tasks, revealing their strengths on standard benchmarks but vulnerabilities on more challenging tests, highlighting the importance of instruction comprehension.

Contribution

It introduces a new, more difficult benchmark for web app coding tasks and analyzes the performance variability of reasoning models under different conditions.

Findings

01

o1 models achieve SOTA on WebApp1K

02

Performance declines on the new WebApp1K-Duo benchmark

03

Models struggle with atypical yet correct test cases

Abstract

This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

onekq/webapp1k
noneOfficial

Datasets

onekq-ai/WebApp1K-Duo-React
dataset· 57 dl
57 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology

MethodsShrink and Fine-Tune · Balanced Selection