Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLMs through Counterfactuals

Xianzhen Luo; Qingfu Zhu; Zhiming Zhang; Mingzheng Xu; Tianhao Cheng; Yixuan Wang; Zheng Chu; Shijie Xuyang; Zhiyuan Ma; YuanTao Fan; Wanxiang Che

arXiv:2505.14597·cs.CL·May 21, 2025

Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLMs through Counterfactuals

Xianzhen Luo, Qingfu Zhu, Zhiming Zhang, Mingzheng Xu, Tianhao Cheng, Yixuan Wang, Zheng Chu, Shijie Xuyang, Zhiyuan Ma, YuanTao Fan, Wanxiang Che

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new benchmark and fine-tuning framework to evaluate and improve code LLMs' sensitivity to detail changes, leading to better performance on sensitive tasks.

Contribution

The paper presents CTF-Code, a benchmark for code sensitivity, and CTF-Instruct, a fine-tuning method to enhance LLMs' sensitivity to details in code tasks.

Findings

01

LLMs experience over 10% performance drop on CTF-Code with detail perturbations.

02

Fine-tuning with CTF-Instruct data improves LLM performance by over 2% on CTF-Code.

03

Sensitivity-focused training boosts LLM performance on sensitive code benchmarks.

Abstract

Code Sensitivity refers to the ability of Code LLMs to recognize and respond to details changes in problem descriptions. While current code benchmarks and instruction data focus on difficulty and diversity, sensitivity is overlooked. We first introduce the CTF-Code benchmark, constructed using counterfactual perturbations, minimizing input changes while maximizing output changes. The evaluation shows that many LLMs have a more than 10\% performance drop compared to the original problems. To fully utilize sensitivity, CTF-Instruct, an incremental instruction fine-tuning framework, extends on existing data and uses a selection mechanism to meet the three dimensions of difficulty, diversity, and sensitivity. Experiments show that LLMs fine-tuned with CTF-Instruct data achieve over a 2\% improvement on CTF-Code, and more than a 10\% performance boost on LiveCodeBench, validating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luowaterbi/ctf-instruct
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Topic Modeling

MethodsFocus