Cobol2Vec: Learning Representations of Cobol code

Ankit Kulshrestha; Vishwas Lele

arXiv:2201.09448·cs.PL·January 25, 2022·1 cites

Cobol2Vec: Learning Representations of Cobol code

Ankit Kulshrestha, Vishwas Lele

PDF

Open Access

TL;DR

This paper introduces Cobol2Vec, an unsupervised method to generate fixed-dimensional vector representations of COBOL code, facilitating tasks like code retrieval and aiding modernization of legacy systems.

Contribution

The paper presents the first unsupervised embedding approach specifically for COBOL, enabling effective representation learning for legacy mainframe languages.

Findings

01

Effective code retrieval on COBOL corpus

02

Unsupervised embeddings capture semantic code features

03

Potential to improve legacy code modernization

Abstract

There has been a steadily growing interest in development of novel methods to learn a representation of a given input data and subsequently using them for several downstream tasks. The field of natural language processing has seen a significant improvement in different tasks by incorporating pre-trained embeddings into their pipelines. Recently, these methods have been applied to programming languages with a view to improve developer productivity. In this paper, we present an unsupervised learning approach to encode old mainframe languages into a fixed dimensional vector space. We use COBOL as our motivating example and create a corpus and demonstrate the efficacy of our approach in a code-retrieval task on our corpus.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning and Data Classification · Advanced Malware Detection Techniques