TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework
Shuzheng Gao, Eric John Li, Man Ho Lam, Jingyu Xiao, Yuxuan Wan, Chaozheng Wang, Ng Man Tik, Michael R. Lyu

TL;DR
TREAT is a comprehensive evaluation framework designed to assess the trustworthiness and reliability of code language models across multiple tasks, languages, and modalities, addressing limitations of existing benchmarks.
Contribution
The paper introduces TREAT, a holistic and multi-faceted evaluation framework for code LLMs, incorporating robustness, multi-task, multi-language, and multi-modality assessments.
Findings
Models vary significantly across tasks
Multi-modal models have limitations in UI code generation
Robustness evaluation reveals model vulnerabilities
Abstract
Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models' trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reliability of models. To bridge this gap, we present an evaluation framework called TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) that provides a holistic assessment of model performance in code intelligence tasks. Our evaluation framework addresses key limitations in existing approaches with four main improvements: (1) Multi-Task Holistic Evaluation that spans diverse…
Peer Reviews
Decision·Submitted to ICLR 2026
- Holistic benchmark with coverage across different tasks, languages, containing multimodal and robustness assessments. - Section 5.3 and 5.4 present interesting findings, where models with thinking exhibit better robustness to code perturbations and that model's evaluation results are sensitive to changes in prompt.
- Using GPT-4o as the only LLM judge may bias the scores towards GPT* models, for tasks beyond code correctness. Why did the authors not think of an ensemble based ranking? - In page 2, the authors mentioned "Current state-of-the-art models exhibit substantial performance variation and specialization across different programming tasks" to be one of the novel findings. However, LiveCodeBench paper also discusses similar findings in Figure 4 (https://arxiv.org/pdf/2403.07974). This seems to be a
- The framework is comprehensive, covering diverse coding-related tasks, multiple programming languages, and multimodal settings. This provides a broad and unified view of code LLM capabilities. - The inclusion of robustness evaluation through semantically-preserving perturbations is practical and relevant. The finding that existing models degrade significantly under prompt perturbations is insightful and highlights an important real-world limitation.
- The size and scale of each sub-benchmark are not clearly reported, making it difficult to assess coverage and statistical significance. - When enhancing the prompt diversity, the prompt diversification process relies on manual validation, which limits scalability and reproducibility. - On certain tasks such as Code Review, the evaluation results fail to effectively distinguish between models of different sizes or architectures, suggesting limited sensitivity of the metric or dataset. - The
- The paper evaluates 26 state-of-the-art models across diverse tasks, languages, and modalities, offering insights and providing extensive empirical data that could be useful for the community. - The use of three prompts per task demonstrates prompt sensitivity, though the magnitude and practical implications remain unclear (see notes below). - The paper provides detailed appendices with experimental setup, promises code/data release, and supporting reproducibility.
The paper lacks both novelty and practicality. First, the paper is fundamentally an ensemble of existing benchmarks and methods with minimal innovation. The benchmark directly samples from or reuses PolyHumanEval, HumanEval+, MBPP, PrimeVul, SymPrompt/CodaMosa, DesignBench, and CodeCrash without significant modification. For instance, code generation uses problems from GeeksforGeeks and HackerRank with EvalPlus-style test augmentation, code translation uses PolyHumanEval and GeeksforGeeks, vul
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices
