A Critical Study of What Code-LLMs (Do Not) Learn

Abhinav Anand; Shweta Verma; Krishna Narasimhan; Mira Mezini

arXiv:2406.11930·cs.SE·June 19, 2024

A Critical Study of What Code-LLMs (Do Not) Learn

Abhinav Anand, Shweta Verma, Krishna Narasimhan, Mira Mezini

PDF

Open Access 1 Repo

TL;DR

This study provides a detailed analysis of code-LLMs, revealing that they encode certain token relations but fail to encode others, especially between syntactic tokens and identifiers, with larger models encoding less information.

Contribution

It offers a novel fine-grained analysis of what code-LLMs encode, highlighting limitations in their understanding of code relations and effects of model size and fine-tuning.

Findings

01

Models encode relations among syntactic tokens and identifiers separately.

02

Models fail to encode relations between syntactic tokens and identifiers.

03

Larger models encode less information about code.

Abstract

Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stg-tud/code-llm-critical-evaluation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComparative and International Law Studies · Artificial Intelligence in Law · Legal Language and Interpretation

MethodsSoftmax · Attention Is All You Need