CONCUR: Benchmarking LLMs for Concurrent Code Generation
Jue Huang, Tarek Mahmud, Corina Pasareanu, Guowei Yang

TL;DR
CONCUR is a new benchmark designed to evaluate large language models' ability to generate concurrent code, addressing the gap left by existing benchmarks that focus only on sequential code, and highlighting current models' limitations.
Contribution
We created CONCUR, a specialized benchmark with 115 concurrency problems, to evaluate LLMs' performance on concurrent code generation, a previously underexplored area.
Findings
Current LLMs show limitations in generating correct concurrent code.
CONCUR provides a diverse set of concurrency problems for evaluation.
Benchmark highlights the need for improved models for concurrent code generation.
Abstract
Leveraging Large Language Models (LLMs) for code generation has increasingly emerged as a common practice in the domain of software engineering. Relevant benchmarks have been established to evaluate the code generation capabilities of LLMs. However, existing benchmarks focus primarily on sequential code, lacking the ability to effectively evaluate LLMs on concurrent code generation. Compared to sequential code, concurrent code exhibits greater complexity and possesses unique types of bugs, such as deadlocks and race conditions, that do not occur in sequential code. Therefore, a benchmark for evaluating sequential code generation cannot be useful for evaluating concurrent code generation with LLMs. To address this gap, we designed a benchmark CONCUR specifically aimed at evaluating the capability of LLMs to generate concurrent code. CONCUR consists of a base set of 43 concurrency…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Novelty in Benchmark Design: The paper introduces CONCUR, the first benchmark specifically targeting multi-threaded code generation, filling a gap left by prior benchmarks that focus only on sequential programs. 2. Concurrency-Aware Problem Construction: The benchmark is carefully designed to enforce multi-threading features and concurrency-specific requirements, ensuring that generated programs must exhibit correct thread behavior and handle potential concurrency issues. 3. Clear Presentat
1. The benchmark includes only 43 Java programs, which is a relatively small number and may limit its coverage of diverse concurrent programming scenarios. 2. Although prompts for each program are provided in the public repository, the programs themselves are simple in functionality and description, which may not effectively evaluate LLMs’ ability to generate complex multi-threaded code.
- Paper is well-written and motivations are sound (most code benchmarks focus on single proc. code). - Thoughtful methodology. Steps are taken to ensure solutions can be verified without excessive computation resources (e.g. by limiting num threads, etc) - Paper shows benefit of going beyond CodeBLEU and provides error analysis of top models.
- Benchmark is simple and problems are sourced from a textbook published in 2006. The proposed test set is only 43 problems, curated by the authors, which is extremely small. Additionally there is a risk of contamination as models may have trained on this textbook. The authors do not provide any insight into the contamination risk. - Evaluation metric is based on compilation and verification success, not test case passing. Due to the nature of the problems (which are presented without completed
The benchmark seems to be well-designed, and the evaluation of the 22 LLMs is thorough, covering all of the important models. By far the greatest strength of this benchmark is its use of model-checking to catch concurrency bugs. As the use of coding LLMs continues to proliferate, so will LLM-introduced bugs, and concurrency-related bugs like race conditions are notoriously difficult to catch. Formal static or dynamic analysis is currently severely underused as a way to evaluate code quality,
The benchmark only contains 43 problems. Perhaps most importantly the problems are all drawn from a textbook, which was published almost 20 years ago. This means that the textbook, or similar problems, are likely in the training data of SOTA LLMs. The benchmark would be strengthened by having more problems, and including not-previously published problems. It would be interesting to include problems in a language other than Java -- e.g. the Rust type system also protects against concurren
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Model-Driven Software Engineering Techniques
