Cobol2Vec: Learning Representations of Cobol code
Ankit Kulshrestha, Vishwas Lele

TL;DR
This paper introduces Cobol2Vec, an unsupervised method to generate fixed-dimensional vector representations of COBOL code, facilitating tasks like code retrieval and aiding modernization of legacy systems.
Contribution
The paper presents the first unsupervised embedding approach specifically for COBOL, enabling effective representation learning for legacy mainframe languages.
Findings
Effective code retrieval on COBOL corpus
Unsupervised embeddings capture semantic code features
Potential to improve legacy code modernization
Abstract
There has been a steadily growing interest in development of novel methods to learn a representation of a given input data and subsequently using them for several downstream tasks. The field of natural language processing has seen a significant improvement in different tasks by incorporating pre-trained embeddings into their pipelines. Recently, these methods have been applied to programming languages with a view to improve developer productivity. In this paper, we present an unsupervised learning approach to encode old mainframe languages into a fixed dimensional vector space. We use COBOL as our motivating example and create a corpus and demonstrate the efficacy of our approach in a code-retrieval task on our corpus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification · Advanced Malware Detection Techniques
