Defects4Log: Benchmarking LLMs for Logging Code Defect Detection and Reasoning
Xin Wang, Zhenhao Li, Zishuo Ding

TL;DR
This paper introduces a comprehensive benchmark and framework for evaluating large language models' ability to detect and reason about logging code defects, highlighting current limitations and potential improvements.
Contribution
It develops a detailed taxonomy of logging defects, constructs a real-world defect dataset, and evaluates LLMs' performance with insights for future enhancement.
Findings
LLMs struggle with defect detection using source code alone.
Incorporating detailed defect scenarios improves detection accuracy by 10.9%.
The study provides guidance for practitioners and a foundation for future LLM-based defect detection.
Abstract
Logging code is written by developers to capture system runtime behavior and plays a vital role in debugging, performance analysis, and system monitoring. However, defects in logging code can undermine the usefulness of logs and lead to misinterpretations. Although prior work has identified several logging defect patterns and provided valuable insights into logging practices, these studies often focus on a narrow range of defect patterns derived from limited sources (e.g., commit histories) and lack a systematic and comprehensive analysis. Moreover, large language models (LLMs) have demonstrated promising generalization and reasoning capabilities across a variety of code-related tasks, yet their potential for detecting logging code defects remains largely unexplored. In this paper, we derive a comprehensive taxonomy of logging code defects, which encompasses seven logging code defect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
