Code Linting using Language Models
Darren Holden, Nafiseh Kahani

TL;DR
This paper explores using large language models to create a versatile, language-independent code linter capable of detecting various issues with high accuracy, aiming to improve over traditional, language-specific linters.
Contribution
It introduces a novel approach of training language models for multi-issue detection in code, demonstrating high accuracy and versatility compared to conventional linters.
Findings
Achieved 84.9% accuracy in binary issue detection.
Achieved 83.6% accuracy in multi-issue classification.
Demonstrated the potential for language models to replace traditional linters.
Abstract
Code linters play a crucial role in developing high-quality software systems by detecting potential problems (e.g., memory leaks) in the source code of systems. Despite their benefits, code linters are often language-specific, focused on certain types of issues, and prone to false positives in the interest of speed. This paper investigates whether large language models can be used to develop a more versatile code linter. Such a linter is expected to be language-independent, cover a variety of issue types, and maintain high speed. To achieve this, we collected a large dataset of code snippets and their associated issues. We then selected a language model and trained two classifiers based on the collected datasets. The first is a binary classifier that detects if the code has issues, and the second is a multi-label classifier that identifies the types of issues. Through extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Digital Communication and Language · Natural Language Processing Techniques
