SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA

Jing Zhang; Lianghong Guo; Yanlin Wang; Mingwei Liu; Jiachi Chen; Yuchi Ma; Ensheng Shi; Terry Yue Zhuo; Hongyu Zhang; Zibin Zheng

arXiv:2512.08867·cs.SE·December 10, 2025

SimpleDevQA: Benchmarking Large Language Models on Development Knowledge QA

Jing Zhang, Lianghong Guo, Yanlin Wang, Mingwei Liu, Jiachi Chen, Yuchi Ma, Ensheng Shi, Terry Yue Zhuo, Hongyu Zhang, Zibin Zheng

PDF

Open Access

TL;DR

This paper introduces SimpleDevQA, a multilingual benchmark derived from real user dialogues, to evaluate large language models on development knowledge questions, highlighting the limitations of existing benchmarks and demonstrating the impact of knowledge injection strategies.

Contribution

It creates a new benchmark from real dialogues covering broader development knowledge and evaluates LLMs' performance, revealing insights on knowledge injection and model confidence.

Findings

01

Code LLMs outperform general LLMs of similar scale.

02

Knowledge injection via RAG improves accuracy by 11.3%.

03

LLMs tend to be overconfident in their answers.

Abstract

The Development Knowledge Question Answering (Dev Knowledge QA) task aims to provide natural language answers to knowledge-seeking questions during software development. To investigate its importance and to what extent it has been explored, we analyze real user-LLM dialogues from WildChat and find that: (1) The Dev Knowledge QA task accounts for 39.6% of interactions(highest among all tasks), revealing broad knowledge needs beyond code generation (32.3%). (2) Only 27.5% of real Dev Knowledge QA dialogues focus on code understanding, leaving out development knowledge-seeking. (3) Only 17.1% of real-world Dev Knowledge QA dialogues can be used for constructing a benchmark. Existing benchmarks have two primary limitations for evaluating the Dev Knowledge QA capability of LLMs. First, existing benchmarks offer a limited development knowledge scope, mainly focusing on code understanding and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Expert finding and Q&A systems