Tricky$^2$: Towards a Benchmark for Evaluating Human and LLM Error Interactions
Cole Granger, Dipin Khati, Daniel Rodriguez-Cardenas, Denys Poshyvanyk

TL;DR
This paper introduces Tricky$^2$, a comprehensive benchmark dataset combining human and LLM-generated errors in code, to analyze error interactions, repair robustness, and hybrid code reliability.
Contribution
It presents a novel hybrid dataset with human and LLM errors across multiple programming languages, enabling detailed analysis of mixed-error behaviors and repair strategies.
Findings
Dataset includes human-only, LLM-only, and mixed errors.
Baseline evaluations demonstrate the dataset's utility for classification, localization, and repair tasks.
Framework supports analysis of error interaction and repair robustness in hybrid code.
Abstract
Large language models (LLMs) are increasingly integrated into software development workflows, yet they often introduce subtle logic or data-misuse errors that differ from human bugs. To study how these two error types interact, we construct Tricky, a hybrid dataset that augments the existing TrickyBugs corpus of human-written defects with errors injected by both GPT-5 and OpenAI-oss-20b across C++, Python, and Java programs. Our approach uses a taxonomy-guided prompting framework to generate machine-originated bugs while preserving original human defects and program structure. The resulting corpus spans human-only, LLM-only, and human+LLM splits, enabling analysis of mixed-origin error behavior, multi-bug repair robustness, and reliability in hybrid human-machine code. This paper outlines the dataset construction pipeline and illustrates its use through small-scale baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software Testing and Debugging Techniques
