Tricky$^2$: Towards a Benchmark for Evaluating Human and LLM Error Interactions

Cole Granger; Dipin Khati; Daniel Rodriguez-Cardenas; Denys Poshyvanyk

arXiv:2601.18949·cs.SE·January 28, 2026

Tricky$^2$: Towards a Benchmark for Evaluating Human and LLM Error Interactions

Cole Granger, Dipin Khati, Daniel Rodriguez-Cardenas, Denys Poshyvanyk

PDF

Open Access

TL;DR

This paper introduces Tricky$^2$, a comprehensive benchmark dataset combining human and LLM-generated errors in code, to analyze error interactions, repair robustness, and hybrid code reliability.

Contribution

It presents a novel hybrid dataset with human and LLM errors across multiple programming languages, enabling detailed analysis of mixed-error behaviors and repair strategies.

Findings

01

Dataset includes human-only, LLM-only, and mixed errors.

02

Baseline evaluations demonstrate the dataset's utility for classification, localization, and repair tasks.

03

Framework supports analysis of error interaction and repair robustness in hybrid code.

Abstract

Large language models (LLMs) are increasingly integrated into software development workflows, yet they often introduce subtle logic or data-misuse errors that differ from human bugs. To study how these two error types interact, we construct Tricky $^{2}$ , a hybrid dataset that augments the existing TrickyBugs corpus of human-written defects with errors injected by both GPT-5 and OpenAI-oss-20b across C++, Python, and Java programs. Our approach uses a taxonomy-guided prompting framework to generate machine-originated bugs while preserving original human defects and program structure. The resulting corpus spans human-only, LLM-only, and human+LLM splits, enabling analysis of mixed-origin error behavior, multi-bug repair robustness, and reliability in hybrid human-machine code. This paper outlines the dataset construction pipeline and illustrates its use through small-scale baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software Testing and Debugging Techniques