CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation
Guang Yang, Yu Zhou, Xiang Chen, Wei Zheng, Xing Hu, Xin Zhou, David Lo, Taolue Chen

TL;DR
This paper introduces CODE-DITING, a new reasoning-based metric for evaluating code that balances accuracy, efficiency, and explainability, outperforming larger models and existing methods.
Contribution
The paper proposes CODE-DITING, a novel, scalable, and explainable code evaluation method that transfers reasoning capabilities from large models to smaller ones, improving performance and reducing costs.
Findings
CODE-DITING 1.5B outperforms similarly sized models.
CODE-DITING 7B surpasses GPT-4o and DeepSeek-V3 671B.
The method is robust to preference leakage.
Abstract
Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and the generated code. To systematically understand the landscape of these LLM-as-Judge methods, we conduct a comprehensive empirical study across three diverse datasets. Our investigation reveals the pros and cons of two categories of LLM-as-Judge methods: the methods based on general foundation models can achieve good performance but require complex prompts and lack explainability, while the methods based on reasoning foundation models provide better explainability with simpler prompts but demand…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Model-Driven Software Engineering Techniques · Software Testing and Debugging Techniques
