LLM-based Content Classification Approach for GitHub Repositories by the README Files
Malik Uzair Mehmood, Shahid Hussain, Wen Li Wang, Muhammad Usama Malik

TL;DR
This paper develops an LLM-based method to automatically classify sections of GitHub README files, significantly improving accuracy and efficiency, and explores parameter-efficient fine-tuning techniques as economical alternatives.
Contribution
It introduces a fine-tuning approach for LLMs to classify README sections, outperforming existing methods and incorporating PEFT techniques like LoRA for cost-effective training.
Findings
Achieved an F1 score of 0.98 in classification accuracy.
Demonstrated the effectiveness of PEFT techniques like LoRA.
Outperformed current state-of-the-art methods.
Abstract
GitHub is the world's most popular platform for storing, sharing, and managing code. Every GitHub repository has a README file associated with it. The README files should contain project-related information as per the recommendations of GitHub to support the usage and improvement of repositories. However, GitHub repository owners sometimes neglected these recommendations. This prevents a GitHub repository from reaching its full potential. This research posits that the comprehensiveness of a GitHub repository's README file significantly influences its adoption and utilization, with a lack of detail potentially hindering its full potential for widespread engagement and impact within the research community. Large Language Models (LLMs) have shown great performance in many text-based tasks including text classification, text generation, text summarization and text translation. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
