A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep   Learning for Source Code

Nadezhda Chirkova; Sergey Troshin

arXiv:2010.12663·cs.SE·April 28, 2021

A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Nadezhda Chirkova, Sergey Troshin

PDF

1 Repo

TL;DR

This paper introduces a simple identifier anonymization method as a preprocessing step to effectively handle out-of-vocabulary identifiers in source code, improving deep learning model performance in code tasks.

Contribution

The paper presents a novel, easy-to-implement anonymization technique for OOV identifiers that enhances deep learning models for source code processing.

Findings

01

Improves Transformer performance in code completion

02

Enhances bug fixing accuracy

03

Simple preprocessing step effectively handles OOV identifiers

Abstract

There is an emerging interest in the application of natural language processing models to source code processing tasks. One of the major problems in applying deep learning to software engineering is that source code often contains a lot of rare identifiers, resulting in huge vocabularies. We propose a simple, yet effective method, based on identifier anonymization, to handle out-of-vocabulary (OOV) identifiers. Our method can be treated as a preprocessing step and, therefore, allows for easy implementation. We show that the proposed OOV anonymization method significantly improves the performance of the Transformer in two code processing tasks: code completion and bug fixing.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bayesgroup/code_transformers
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Byte Pair Encoding · Softmax · Adam · Dense Connections