TL;DR
HiGitClass is a flexible, keyword-driven hierarchical classification framework for GitHub repositories that effectively integrates structured and unstructured data to improve topic-based search and analysis.
Contribution
The paper introduces HiGitClass, a novel framework that addresses challenges in automatic repository classification using minimal supervision and keyword hierarchies.
Findings
Outperforms existing weakly-supervised methods
Effectively integrates multi-modal signals
Handles supervision scarcity and bias
Abstract
GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
