FullStack Bench: Evaluating LLMs as Full Stack Coders

Bytedance-Seed-Foundation-Code-Team: Yao Cheng; Jianfeng Chen; Jie Chen; Li Chen; Liyu Chen; Wentao Chen; Zhengyu Chen; Shijie Geng; Aoyan Li; Bo Li; Bowen Li; Linyi Li; Boyi Liu; Jiaheng Liu; Kaibo Liu; Qi Liu; Shukai Liu; Siyao Liu; Tianyi Liu; Tingkai Liu; Yongfei Liu; Rui Long; Jing Mai; Guanghan Ning; Z.Y. Peng; Kai Shen; Jiahao Su; Jing Su; Tao Sun; Yifan Sun; Yunzhe Tao; Guoyin Wang; Siwei Wang; Xuwu Wang; Yite Wang; Zihan Wang; Jinxiang Xia; Liang Xiang; Xia Xiao; Yongsheng Xiao; Chenguang Xi; Shulin Xin; Jingjing Xu; Shikun Xu; Hongxia Yang; Jack Yang; Yingxiang Yang; Jianbo Yuan; Jun Zhang; Yufeng Zhang; Yuyu Zhang; Shen Zheng; He Zhu; Ming Zhu

arXiv:2412.00535·cs.AI·May 13, 2025·2 cites

FullStack Bench: Evaluating LLMs as Full Stack Coders

Bytedance-Seed-Foundation-Code-Team: Yao Cheng, Jianfeng Chen, Jie Chen, Li Chen, Liyu Chen, Wentao Chen, Zhengyu Chen, Shijie Geng, Aoyan Li, Bo Li, Bowen Li, Linyi Li, Boyi Liu, Jiaheng Liu, Kaibo Liu, Qi Liu, Shukai Liu, Siyao Liu, Tianyi Liu, Tingkai Liu, Yongfei Liu

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces FullStack Bench, a comprehensive dataset and evaluation framework for assessing large language models' capabilities across full-stack programming tasks in multiple languages, with a supporting sandbox tool.

Contribution

The paper presents a new multi-domain, multilingual code evaluation dataset and an execution sandbox, addressing limitations of existing benchmarks and enabling more realistic assessments.

Findings

01

FullStack Bench covers diverse application domains and 16 programming languages.

02

Experimental results show the effectiveness of our dataset and sandbox tool.

03

Our evaluation highlights the strengths and limitations of current LLMs in full-stack coding.

Abstract

As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

ByteDance/FullStackBench
dataset· 195 dl
195 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLibrary Science and Information Systems · Biomedical Text Mining and Ontologies