Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Yuhan Cao; Zian Chen; Kun Quan; Ziliang Zhang; Yu Wang; Xiaoning Dong; Yeqi Feng; Guanzhong He; Jingcheng Huang; Jianhao Li; Yixuan Tan; Jiafu Tang; Yilin Tang; Junlei Wu; Qianyu Xiao; Can Zheng; Shouchen Zhou; Yuxiang Zhu; Yiming Huang; Tianxing He

arXiv:2506.06821·cs.CL·January 15, 2026

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tianxing He

PDF

Open Access

TL;DR

This paper evaluates the ability of large language models to generate reliable test case generators for competitive programming problems, introducing a benchmark and analyzing their strengths and limitations in generating both valid and targeted test cases.

Contribution

It introduces TCGBench, a new benchmark for testing LLMs in generating test case generators for competitive programming, and provides insights into their capabilities and challenges.

Findings

01

LLMs can generate valid test case generators in most cases.

02

LLMs struggle to generate targeted test cases that expose bugs.

03

Performance can be improved with curated instruction datasets.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Topic Modeling · Machine Learning and Algorithms