An Ensemble Approach for Annotating Source Code Identifiers with   Part-of-speech Tags

Christian D. Newman; Michael J. Decker; Reem S. AlSuhaibani; Anthony; Peruma; Satyajit Mohapatra; Tejal Vishnoi; Marcos Zampieri; Mohamed W.; Mkaouer; Timothy J. Sheldon; Emily Hill

arXiv:2109.00629·cs.SE·September 6, 2021

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony, Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W., Mkaouer, Timothy J. Sheldon, Emily Hill

PDF

1 Repo

TL;DR

This paper introduces an ensemble method combining multiple POS taggers to improve the accuracy of annotating source code identifiers, achieving significant performance gains over individual taggers.

Contribution

It presents a novel ensemble approach that leverages three state-of-the-art taggers to enhance source code identifier annotation accuracy.

Findings

01

Ensemble achieves 75% accuracy at identifier level

02

Ensemble achieves 84-86% accuracy at word level

03

Improves accuracy by +17% points over best individual tagger

Abstract

This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE, and Stanford. We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names. We also study and discuss the weaknesses of our tagger to promote the future amelioration of these problems through further research. Our results show that the ensemble achieves 75\% accuracy at the identifier level and 84-86\% accuracy at the word level.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SCANL/ensemble_tagger
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.