Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching
Junhao Chen, Jingxuan Zhang, Jian He, Yixuan Tang, Weiqin Zou

TL;DR
This paper introduces a novel method to create a shared semantic space for multiple programming languages by unifying AST representations and using graph matching, significantly improving cross-language code clone detection and retrieval.
Contribution
The approach unifies AST node labels and employs graph matching to encode functional equivalence across languages, outperforming existing methods in key tasks.
Findings
Outperforms state-of-the-art in cross-language clone detection with near-perfect scores.
Achieves a 13% relative improvement in cross-language code retrieval MRR.
Effectively bridges syntax differences between programming languages.
Abstract
The lexical and syntactic disparities among different programming languages (e.g., Java and Python) pose significant challenges for multi-language software engineering tasks such as cross-language code clone detection and code retrieval, since queries or code snippets written in one programming language often fail to match equivalent artifacts in another. To bridge this gap between different programming languages, we proposed a novel approach to construct a multi-language shared semantic space, in which functionally equivalent source code written in different programming languages are close to each other. In this approach, we first map the Abstract Syntax Tree (AST) node labels of the code snippets written in different programming languages into a unified label set, thus compressing high-dimensional language-specific tokens into a common embedding space. Then, we employ a Graph Matching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
