Shell Language Processing: Unix command parsing for Machine Learning

Dmitrijs Trizna

arXiv:2107.02438·cs.LG·July 8, 2022

Shell Language Processing: Unix command parsing for Machine Learning

Dmitrijs Trizna

PDF

Open Access 1 Repo

TL;DR

This paper introduces a Shell Language Preprocessing library designed for parsing Unix commands, demonstrating significant improvements over traditional NLP methods in security classification tasks.

Contribution

The paper presents a novel tokenization and encoding approach tailored for shell commands, addressing limitations of conventional NLP pipelines.

Findings

01

F1 score improved from 0.392 to 0.874 in security classification

02

Method outperforms standard ICT tokenization techniques

03

Highlights the need for specialized preprocessing in shell command analysis

Abstract

In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed at parsing Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples of when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1 score from 0.392 to 0.874.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dtrizna/slp
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Security and Verification in Computing