CS-Bench: A Comprehensive Benchmark for Large Language Models towards   Computer Science Mastery

Xiaoshuai Song; Muxi Diao; Guanting Dong; Zhengyang Wang; Yujia Fu,; Runqi Qiao; Zhexu Wang; Dayuan Fu; Huangxuan Wu; Bin Liang; Weihao Zeng,; Yejie Wang; Zhuoma GongQue; Jianing Yu; Qiuna Tan; Weiran Xu

arXiv:2406.08587·cs.CL·March 3, 2025

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu,, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng,, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, Weiran Xu

PDF

Open Access 1 Repo

TL;DR

CS-Bench is a multilingual, comprehensive benchmark designed to evaluate large language models across diverse computer science subfields, revealing insights into their strengths, weaknesses, and potential for CS mastery.

Contribution

Introduces CS-Bench, the first extensive multilingual benchmark for evaluating LLMs in computer science, covering 26 subfields and enabling comprehensive performance analysis.

Findings

01

Large models perform better on CS tasks.

02

Knowledge supplementation improves LLM performance.

03

Math and coding skills correlate with CS capabilities.

Abstract

Large language models (LLMs) have demonstrated significant potential in advancing various fields of research and society. However, the current community of LLMs overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first multilingual (English, Chinese, French, German) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 10K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

csbench/csbench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Natural Language Processing Techniques · Topic Modeling