Language-Agnostic Representation Learning of Source Code from Structure and Context
Daniel Z\"ugner, Tobias Kirschstein, Michele Catasta, Jure Leskovec,, Stephan G\"unnemann

TL;DR
This paper introduces a language-agnostic model that jointly learns from source code and its AST structure, achieving state-of-the-art monolingual and the first multilingual code summarization, especially benefiting low-resource languages.
Contribution
It presents a novel joint learning approach using only language-agnostic features from code and AST, enabling effective multilingual code summarization without parallel data.
Findings
State-of-the-art results on monolingual code summarization across five languages.
First successful multilingual code summarization model.
Multilingual training with Structure and Context improves low-resource language performance.
Abstract
Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Topic Modeling
