Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised   Learning for Software Analytics

Huy Tu; Tim Menzies

arXiv:2302.01997·cs.SE·February 7, 2023

Less, but Stronger: On the Value of Strong Heuristics in Semi-supervised Learning for Software Analytics

Huy Tu, Tim Menzies

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that using strong, domain-specific heuristics in semi-supervised learning significantly improves performance in software analytics tasks, often with minimal labeled data.

Contribution

It introduces the FRUGAL algorithm that leverages strong heuristics, outperforming standard SSL methods across multiple software engineering domains.

Findings

01

FRUGAL requires only 2.5% labeled data

02

Outperforms standard SSL algorithms in four domains

03

Strong heuristics enhance semi-supervised learning effectiveness

Abstract

In many domains, there are many examples and far fewer labels for those examples; e.g. we may have access to millions of lines of source code, but access to only a handful of warnings about that code. In those domains, semi-supervised learners (SSL) can extrapolate labels from a small number of examples to the rest of the data. Standard SSL algorithms use ``weak'' knowledge (i.e. those not based on specific SE knowledge) such as (e.g.) co-train two learners and use good labels from one to train the other. Another approach of SSL in software analytics is potentially use ``strong'' knowledge that use SE knowledge. For example, an often-used heuristic in SE is that unusually large artifacts contain undesired properties (e.g. more bugs). This paper argues that such ``strong'' algorithms perform better than those standard, weaker, SSL algorithms. We show this by learning models from labels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huytu7/frugal
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Machine Learning and Data Classification · Software Reliability and Analysis Research