Detecting Hard-Coded Credentials in Software Repositories via LLMs
Chidera Biringa, Gokhan Kul

TL;DR
This paper explores the use of Large Language Models, like GPT, to improve detection of hard-coded credentials in software repositories, achieving a 13% higher F1 score than previous methods.
Contribution
It introduces a novel approach using LLM embeddings combined with deep classifiers to better identify hard-coded credentials, surpassing existing state-of-the-art performance.
Findings
Outperforms current methods by 13% in F1 score
Utilizes LLM embeddings for improved context understanding
Provides publicly available code and data for reproducibility
Abstract
Software developers frequently hard-code credentials such as passwords, generic secrets, private keys, and generic tokens in software repositories, even though it is strictly advised against due to the severe threat to the security of the software. These credentials create attack surfaces exploitable by a potential adversary to conduct malicious exploits such as backdoor attacks. Recent detection efforts utilize embedding models to vectorize textual credentials before passing them to classifiers for predictions. However, these models struggle to discriminate between credentials with contextual and complex sequences resulting in high false positive predictions. Context-dependent Pre-trained Language Models (PLMs) or Large Language Models (LLMs) such as Generative Pre-trained Transformers (GPT) tackled this drawback by leveraging the transformer neural architecture capacity for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
