# A Language-Agnostic Model for Semantic Source Code Labeling

**Authors:** Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, David Slater

arXiv: 1906.01032 · 2019-06-05

## TL;DR

This paper introduces a language-agnostic deep learning model trained on Stack Overflow snippets to automatically label source code with semantic tags, improving code search and understanding across diverse programming languages.

## Contribution

The authors develop a novel deep convolutional neural network that generalizes to multiple programming languages for automatic code labeling, trained on Stack Overflow data.

## Key findings

- Achieved 0.957 ROC AUC on Stack Overflow tags
- Top-1 accuracy of 86.6% on GitHub code documents
- Model effectively transfers knowledge across programming languages

## Abstract

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.01032/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/1906.01032/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/1906.01032/full.md

---
Source: https://tomesphere.com/paper/1906.01032