Exploring Software Naturalness through Neural Language Models
Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui, Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost,, Yufan Zhuang, Giacomo Domeniconi

TL;DR
This paper investigates whether transformer-based language models can understand source code similarly to natural language, demonstrating their effectiveness in AST feature discovery and vulnerability detection without relying on traditional AST-based features.
Contribution
It introduces a novel sequence labeling task to probe transformer models' understanding of AST features directly from raw source code.
Findings
Transformer models achieve high accuracy in AST tagging.
Models perform comparably to graph-based approaches in vulnerability detection.
Raw source code models can discover structural features traditionally extracted via ASTs.
Abstract
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks. Present approaches to code analysis depend heavily on features derived from the Abstract Syntax Tree (AST) while our transformer-based language models work on raw source code. This work is the first to investigate whether such language models can discover AST features automatically. To achieve this, we introduce a sequence labeling task that directly probes the language models understanding of AST. Our results show that transformer based language models achieve high accuracy in the AST tagging task. Furthermore, we evaluate our model on a software vulnerability identification task. Importantly, we show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Reliability and Analysis Research
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding
