DeepCodeProbe: Towards Understanding What Models Trained on Code Learn

Vahid Majdinasab; Amin Nikanjam; Foutse Khomh

arXiv:2407.08890·cs.SE·July 15, 2024

DeepCodeProbe: Towards Understanding What Models Trained on Code Learn

Vahid Majdinasab, Amin Nikanjam, Foutse Khomh

PDF

Open Access

TL;DR

DeepCodeProbe is a probing method that analyzes how machine learning models trained on code learn syntax and representations, revealing their capabilities, limitations, and patterns, with practical recommendations for improving interpretability and performance.

Contribution

We introduce DeepCodeProbe, a novel probing approach to understand syntax learning in ML models for code, and provide insights and best practices for training more interpretable models.

Findings

01

Small models capture some syntactic abstractions

02

Larger models improve syntax learning but risk overfitting

03

Models learn specific code patterns from training data

Abstract

Machine learning models trained on code and related artifacts offer valuable support for software maintenance but suffer from interpretability issues due to their complex internal variables. These concerns are particularly significant in safety-critical applications where the models' decision-making processes must be reliable. The specific features and representations learned by these models remain unclear, adding to the hesitancy in adopting them widely. To address these challenges, we introduce DeepCodeProbe, a probing approach that examines the syntax and representation learning abilities of ML models designed for software maintenance tasks. Our study applies DeepCodeProbe to state-of-the-art models for code clone detection, code summarization, and comment generation. Findings reveal that while small models capture abstract syntactic representations, their ability to fully grasp…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification