SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
Zhengyang Shan, Xu Qian, Jiayun Xin, Minghui Xu, Yue Zhang, Zhen Yang, Hao Wu, Xiuzhen Cheng

TL;DR
SAGE introduces a framework that amplifies vulnerability signals in LLMs, significantly improving detection accuracy and robustness across diverse datasets and languages by addressing the Signal Submersion problem.
Contribution
The paper proposes SAGE, a novel method using task-conditional autoencoders to recover and amplify faint vulnerability signals in LLM-based detection, surpassing existing approaches.
Findings
SAGE increases internal Signal-to-Noise Ratio by 12.7×.
Achieves up to 318% MCC improvement on unseen data.
Maintains performance across 13 programming languages.
Abstract
Software vulnerabilities are a primary threat to modern infrastructure. While static analysis and Graph Neural Networks have long served as the foundation for vulnerability detection, the emergence of Large Language Models (LLMs) has introduced a transformative paradigm driven by superior semantic reasoning and cross-environment generalization. However, in the context of LLM-based vulnerability detection, we identify a fundamental bottleneck in these models termed \textbf{Signal Submersion}: a state where features related to vulnerability are activated internally but numerically overwhelmed by dominant functional semantics. To address this, we propose \textbf{SAGE} (\textbf{S}ignal-\textbf{A}mplified \textbf{G}uided \textbf{E}mbeddings), a framework that shifts from passive signal submersion to active signal recovery. SAGE integrates task-conditional Sparse Autoencoders (SAEs) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
