TL;DR
This paper demonstrates that Sparse Autoencoders can effectively detect Java function bugs using features from pretrained LLMs, outperforming fine-tuned baselines without additional training.
Contribution
It provides the first empirical evidence that SAEs can detect software bugs directly from pretrained LLM representations without fine-tuning.
Findings
SAEs achieve up to 89% F1 score in bug detection.
SAEs outperform fine-tuned transformer baselines.
SAEs provide an interpretable, lightweight alternative for vulnerability detection.
Abstract
Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
