Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification
Kesu Wang, Meng Yan, He Zhang, Haibo Hu

TL;DR
This paper introduces a unified AST neural network model for cross-language program classification, effectively capturing semantic features across different programming languages and outperforming existing methods.
Contribution
The paper proposes a novel UAST neural network with unified mechanisms for AST representation and vocabulary, enabling effective cross-language program classification.
Findings
UAST outperforms state-of-the-art baselines in accuracy and F1-score.
A new benchmark dataset with 20,000 files across five languages was created.
The approach effectively reduces feature gaps between programming languages.
Abstract
Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a Unified Abstract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
