Software Entity Recognition with Noise-Robust Learning
Tai Nguyen, Yifeng Di, Joohan Lee, Muhao Chen, Tianyi Zhang

TL;DR
This paper introduces a noise-robust learning approach for software entity recognition that leverages a large Wikipedia-based lexicon and dataset, significantly improving recognition accuracy over existing methods.
Contribution
It develops a comprehensive software entity lexicon and dataset, and proposes self-regularization to enhance model robustness against noisy training data.
Findings
Models with self-regularization outperform vanilla models.
Significant improvements over state-of-the-art on benchmark datasets.
Public release of models, data, and code for future research.
Abstract
Recognizing software entities such as library names from free-form text is essential to enable many software engineering (SE) technologies, such as traceability link recovery, automated documentation, and API recommendation. While many approaches have been proposed to address this problem, they suffer from small entity vocabularies or noisy training data, hindering their ability to recognize software entities mentioned in sophisticated narratives. To address this challenge, we leverage the Wikipedia taxonomy to develop a comprehensive entity lexicon with 79K unique software entities in 12 fine-grained types, as well as a large labeled dataset of over 1.7M sentences. Then, we propose self-regularization, a noise-robust learning approach, to the training of our software entity recognition (SER) model by accounting for many dropouts. Results show that models trained with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Wikis in Education and Collaboration · Topic Modeling
MethodsLib
