CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research
Sian-Yao Huang, Cheng-Lin Yang, Che-Yu Lin, Chun-Ying Huang

TL;DR
This paper introduces CmdCaliper, a semantic-aware command-line embedding model trained on a new dataset CyPHER, enabling improved security-related command analysis and demonstrating superior performance over larger models.
Contribution
The paper presents the first dataset of similar command lines and a novel embedding model, CmdCaliper, optimized for cybersecurity tasks, with publicly released resources.
Findings
CmdCaliper outperforms larger SOTA models in command-line similarity tasks.
The CyPHER dataset enables unbiased evaluation of command-line embeddings.
Data generation using LLMs is feasible for cybersecurity applications.
Abstract
This research addresses command-line embedding in cybersecurity, a field obstructed by the lack of comprehensive datasets due to privacy and regulation concerns. We propose the first dataset of similar command lines, named CyPHER, for training and unbiased evaluation. The training set is generated using a set of large language models (LLMs) comprising 28,520 similar command-line pairs. Our testing dataset consists of 2,807 similar command-line pairs sourced from authentic command-line data. In addition, we propose a command-line embedding model named CmdCaliper, enabling the computation of semantic similarity with command lines. Performance evaluations demonstrate that the smallest version of CmdCaliper (30 million parameters) suppresses state-of-the-art (SOTA) sentence embedding models with ten times more parameters across various tasks (e.g., malicious command-line detection and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Information and Cyber Security · Advanced Malware Detection Techniques
MethodsSparse Evolutionary Training
