Are LLMs Ready for Computer Science Education? A Cross-Domain, Cross-Lingual and Cognitive-Level Evaluation Using Professional Certification Exams

Chen Gao; Chi Liu; Zhengquan Luo; Dongfu Xiao; Maiying Sui; Sheng Shen; Congcong Zhu; Huajie Chen; Xuhan Zuo; Zongyuan Ge; Tianqing Zhu; Wanlei Zhou; Xiaotong Han

arXiv:2604.06898·cs.CY·April 9, 2026

Are LLMs Ready for Computer Science Education? A Cross-Domain, Cross-Lingual and Cognitive-Level Evaluation Using Professional Certification Exams

Chen Gao, Chi Liu, Zhengquan Luo, Dongfu Xiao, Maiying Sui, Sheng Shen, Congcong Zhu, Huajie Chen, Xuhan Zuo, Zongyuan Ge, Tianqing Zhu, Wanlei Zhou, Xiaotong Han

PDF

TL;DR

This study systematically evaluates four recent large language models across multiple domains, languages, and cognitive levels using certification exam questions to assess their suitability for computer science education.

Contribution

It provides a comprehensive cross-domain, cross-lingual, and cognitive-level benchmark of LLMs, highlighting their strengths and limitations for educational applications.

Findings

01

GPT-5 excels in English-language certifications.

02

Qwen-Plus performs better in Chinese contexts.

03

All models struggle with higher-order reasoning and complex tasks.

Abstract

Large language models (LLMs) are increasingly applied in computer science education for tasks such as tutoring, content generation, and code assessment. However, systematic evaluations aligned with formal curricula and certification standards remain limited. This study benchmarked four recent models, including GPT-5, DeepSeek-R1, Qwen-Plus, and Llama-3.3-70B-Instruct, using a dataset of 1,068 questions derived from six certification exams covering networking, office applications, and Java programming. We evaluated performance across language (Chinese vs. English), cognitive levels based on Bloom's Taxonomy, domain knowledge, confidence-accuracy alignment, and robustness to input masking. Results showed that GPT-5 performed best on English-language certifications, while Qwen-Plus performed better in Chinese contexts. DeepSeek-R1 achieved the most balanced cross-lingual performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.