Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis
Shounak Naik, Rajaswa Patil, Swati Agarwal, Veeky Baths

TL;DR
This paper uses Representational Similarity Analysis to evaluate semantic grounding in CodeBERT, revealing that pre-training alone does not induce semantic understanding, but fine-tuning and bimodal inputs significantly improve it.
Contribution
It introduces applying Representational Similarity Analysis to assess semantic grounding in code language models and demonstrates the impact of fine-tuning and multimodal inputs on semantic understanding.
Findings
Pre-training does not induce semantic grounding in CodeBERT.
Fine-tuning on semantic tasks enhances semantic grounding.
Bimodal inputs improve semantic understanding and sample efficiency.
Abstract
Representational Similarity Analysis is a method from cognitive neuroscience, which helps in comparing representations from two different sources of data. In this paper, we propose using Representational Similarity Analysis to probe the semantic grounding in language models of code. We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset. Through our experiments, we show that current pre-training methods do not induce semantic grounding in language models of code, and instead focus on optimizing form-based patterns. We also show that even a little amount of fine-tuning on semantically relevant tasks increases the semantic grounding in CodeBERT significantly. Our ablations with the input modality to the CodeBERT model show that using bimodal inputs (code and natural language) over unimodal inputs (only code) gives better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Machine Learning in Bioinformatics
MethodsCodeBERT
