Classification of URL bitstreams using Bag of Bytes

Keiichi Shima; Daisuke Miyamoto; Hiroshi Abe; Tomohiro Ishihara,; Kazuya Okada; Yuji Sekiya; Hirochika Asai; Yusuke Doi

arXiv:2111.06087·cs.NI·November 12, 2021

Classification of URL bitstreams using Bag of Bytes

Keiichi Shima, Daisuke Miyamoto, Hiroshi Abe, Tomohiro Ishihara,, Kazuya Okada, Yuji Sekiya, Hirochika Asai, Yusuke Doi

PDF

TL;DR

This paper introduces a mechanical feature extraction method for classifying URL bitstreams, demonstrating improved accuracy over existing deep learning approaches using real-world and phishing data.

Contribution

The paper presents a novel mechanical approach for generating URL features that outperforms existing deep learning methods in URL classification accuracy.

Findings

01

Achieved 2-3% higher accuracy than existing DL-based methods.

02

Validated approach on real URL access data and phishing site data.

03

Demonstrated effectiveness of mechanical feature extraction in URL classification.

Abstract

Protecting users from accessing malicious web sites is one of the important management tasks for network operators. There are many open-source and commercial products to control web sites users can access. The most traditional approach is blacklist-based filtering. This mechanism is simple but not scalable, though there are some enhanced approaches utilizing fuzzy matching technologies. Other approaches try to use machine learning (ML) techniques by extracting features from URL strings. This approach can cover a wider area of Internet web sites, but finding good features requires deep knowledge of trends of web site design. Recently, another approach using deep learning (DL) has appeared. The DL approach will help to extract features automatically by investigating a lot of existing sample data. Using this technique, we can build a flexible filtering decision module by keep teaching the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.