Software Entity Recognition with Noise-Robust Learning

Tai Nguyen; Yifeng Di; Joohan Lee; Muhao Chen; Tianyi Zhang

arXiv:2308.10564·cs.SE·August 22, 2023

Software Entity Recognition with Noise-Robust Learning

Tai Nguyen, Yifeng Di, Joohan Lee, Muhao Chen, Tianyi Zhang

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

This paper introduces a noise-robust learning approach for software entity recognition that leverages a large Wikipedia-based lexicon and dataset, significantly improving recognition accuracy over existing methods.

Contribution

It develops a comprehensive software entity lexicon and dataset, and proposes self-regularization to enhance model robustness against noisy training data.

Findings

01

Models with self-regularization outperform vanilla models.

02

Significant improvements over state-of-the-art on benchmark datasets.

03

Public release of models, data, and code for future research.

Abstract

Recognizing software entities such as library names from free-form text is essential to enable many software engineering (SE) technologies, such as traceability link recovery, automated documentation, and API recommendation. While many approaches have been proposed to address this problem, they suffer from small entity vocabularies or noisy training data, hindering their ability to recognize software entities mentioned in sophisticated narratives. To address this challenge, we leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types, as well as a large labeled dataset of over 1.7M sentences. Then, we propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition (SER) model by accounting for many dropouts. Results show that models trained with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taidnguyen/software_entity_recognition
pytorchOfficial

Models

Datasets

taidng/WikiSER
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Wikis in Education and Collaboration · Topic Modeling

MethodsLib