WebApp1K: A Practical Code-Generation Benchmark for Web App Development
Yi Cui

TL;DR
WebApp1K is a practical benchmark designed to evaluate and improve large language models' ability to generate correct and functional web application code, providing insights into model performance and limitations.
Contribution
The paper introduces WebApp1K, a lightweight, easy-to-run benchmark for assessing LLMs in web app development, and provides initial performance analysis of various models.
Findings
Open source LLMs perform close to GPT-4o and Claude 3.5.
Model size correlates strongly with code correctness.
Prompting techniques do not significantly improve performance.
Abstract
We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The benchmark is lightweight and easy to run. We present the initial version of WebApp1K, and share our findings of running the benchmark against the latest frontier LLMs. First, open source LLMs deliver impressive performance, closely trailing behind GPT-4o and Claude 3.5. Second, model size has strong correlation with code correctness. Third, no prompting techniques have been found to lift performance either universally to all models, or significantly to a single model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile and Web Applications · Web Applications and Data Management
