CodeBERT-nt: code naturalness via CodeBERT
Ahmed Khanfir, Matthieu Jimenez, Mike Papadakis, Yves Le Traon

TL;DR
This paper introduces CodeBERT-nt, a method that uses pre-trained language models to measure code naturalness by predicting masked tokens, improving bug prioritization over traditional methods.
Contribution
It proposes a novel approach leveraging pre-trained models for code naturalness estimation, addressing limitations of traditional statistical models and demonstrating improved bug prioritization.
Findings
CodeBERT-nt outperforms random and complexity-based ranking techniques.
It achieves comparable or slightly better results than n-gram models.
The approach effectively prioritizes buggy code lines based on naturalness.
Abstract
Much of software-engineering research relies on the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus is tedious, time-consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpora and estimate the language naturalness that is relative to a specific style of programming or type of project. To overcome these issues, we propose using pre-trained language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software System Performance and Reliability
