SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning
Xueqi Yang, Mariusz Jakubowski, Li Kang, Haojie Yu, Tim Menzies

TL;DR
SparseCoder introduces sparse attention and learned token pruning to enable Transformer-based source code analysis to handle longer sequences efficiently, with faster runtime and improved interpretability.
Contribution
It presents a novel sparse attention and token pruning method that significantly improves sequence length handling and speed over existing models in source code analysis.
Findings
Handles at least twice as long sequences as previous models.
Achieves four times faster runtime and 50% reduction in FLOPs.
Scales linearly with token length, unlike quadratic scaling of other methods.
Abstract
As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due to their self-attention mechanism, which scales quadratically with the sequence length. This paper introduces SparseCoder, an innovative approach incorporating sparse attention and learned token pruning (LTP) method (adapted from natural language processing) to address this limitation. Compared to previous state-of-the-art models CodeBERT, RoBERTa, and CodeT5, our experiments demonstrate that SparseCoder can handle significantly longer input sequences--at least twice as long, within the limits of our hardware resources and data statistics. Additionally, SparseCoder is four times faster than other methods measured in runtime, achieving a 50% reduction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research
