WebApp1K: A Practical Code-Generation Benchmark for Web App Development

Yi Cui

arXiv:2408.00019·cs.SE·August 2, 2024·1 cites

WebApp1K: A Practical Code-Generation Benchmark for Web App Development

Yi Cui

PDF

Open Access 1 Repo

TL;DR

WebApp1K is a practical benchmark designed to evaluate and improve large language models' ability to generate correct and functional web application code, providing insights into model performance and limitations.

Contribution

The paper introduces WebApp1K, a lightweight, easy-to-run benchmark for assessing LLMs in web app development, and provides initial performance analysis of various models.

Findings

01

Open source LLMs perform close to GPT-4o and Claude 3.5.

02

Model size correlates strongly with code correctness.

03

Prompting techniques do not significantly improve performance.

Abstract

We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The benchmark is lightweight and easy to run. We present the initial version of WebApp1K, and share our findings of running the benchmark against the latest frontier LLMs. First, open source LLMs deliver impressive performance, closely trailing behind GPT-4o and Claude 3.5. Second, model size has strong correlation with code correctness. Third, no prompting techniques have been found to lift performance either universally to all models, or significantly to a single model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

onekq/webapp1k
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile and Web Applications · Web Applications and Data Management