Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in   Code

Nan Jiang; Qi Li; Lin Tan; Tianyi Zhang

arXiv:2410.09997·cs.SE·October 15, 2024

Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code

Nan Jiang, Qi Li, Lin Tan, Tianyi Zhang

PDF

Open Access

TL;DR

This paper introduces Collu-Bench, a comprehensive benchmark dataset for predicting hallucinations in code generated by large language models, enabling better understanding and mitigation of such errors.

Contribution

We present Collu-Bench, the first benchmark for code hallucination prediction in LLMs, with detailed features and extensive experiments across multiple models and tasks.

Findings

01

Prediction accuracy ranges from 22.03% to 33.15%.

02

Code hallucination patterns are complex and challenging to localize.

03

More advanced techniques are needed for effective hallucination detection.

Abstract

Despite their success, large language models (LLMs) face the critical challenge of hallucinations, generating plausible but incorrect content. While much research has focused on hallucinations in multiple modalities including images and natural language text, less attention has been given to hallucinations in source code, which leads to incorrect and vulnerable code that causes significant financial loss. To pave the way for research in LLMs' hallucinations in code, we introduce Collu-Bench, a benchmark for predicting code hallucinations of LLMs across code generation (CG) and automated program repair (APR) tasks. Collu-Bench includes 13,234 code hallucination instances collected from five datasets and 11 diverse LLMs, ranging from open-source models to commercial ones. To better understand and predict code hallucinations, Collu-Bench provides detailed features such as the per-step log…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Text Readability and Simplification · Machine Learning in Healthcare

MethodsSoftmax · Attention Is All You Need