Detecting Hard-Coded Credentials in Software Repositories via LLMs

Chidera Biringa; Gokhan Kul

arXiv:2506.13090·cs.CR·June 17, 2025

Detecting Hard-Coded Credentials in Software Repositories via LLMs

Chidera Biringa, Gokhan Kul

PDF

Open Access

TL;DR

This paper explores the use of Large Language Models, like GPT, to improve detection of hard-coded credentials in software repositories, achieving a 13% higher F1 score than previous methods.

Contribution

It introduces a novel approach using LLM embeddings combined with deep classifiers to better identify hard-coded credentials, surpassing existing state-of-the-art performance.

Findings

01

Outperforms current methods by 13% in F1 score

02

Utilizes LLM embeddings for improved context understanding

03

Provides publicly available code and data for reproducibility

Abstract

Software developers frequently hard-code credentials such as passwords, generic secrets, private keys, and generic tokens in software repositories, even though it is strictly advised against due to the severe threat to the security of the software. These credentials create attack surfaces exploitable by a potential adversary to conduct malicious exploits such as backdoor attacks. Recent detection efforts utilize embedding models to vectorize textual credentials before passing them to classifiers for predictions. However, these models struggle to discriminate between credentials with contextual and complex sequences resulting in high false positive predictions. Context-dependent Pre-trained Language Models (PLMs) or Large Language Models (LLMs) such as Generative Pre-trained Transformers (GPT) tackled this drawback by leveraging the transformer neural architecture capacity for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management