A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious   Web Content

Joshua Saxe; Richard Harang; Cody Wild; Hillary Sanders

arXiv:1804.05020·cs.CR·April 16, 2018·1 cites

A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content

Joshua Saxe, Richard Harang, Cody Wild, Hillary Sanders

PDF

Open Access

TL;DR

This paper presents a fast, format-agnostic deep learning method for detecting malicious web pages directly from HTML tokens, achieving high accuracy and speed suitable for real-time security applications.

Contribution

It introduces a novel hierarchical neural network that operates on token streams from static HTML, avoiding complex parsing and emulation, and significantly improves detection speed and accuracy.

Findings

01

97.5% detection rate at 0.1% false positive rate

02

Classifies over 100 web pages per second on commodity hardware

03

Operates effectively in high-frequency data environments like firewalls

Abstract

Malicious web content is a serious problem on the Internet today. In this paper we propose a deep learning approach to detecting malevolent web pages. While past work on web content detection has relied on syntactic parsing or on emulation of HTML and Javascript to extract features, our approach operates directly on a language-agnostic stream of tokens extracted directly from static HTML files with a simple regular expression. This makes it fast enough to operate in high-frequency data contexts like firewalls and web proxies, and allows it to avoid the attack surface exposure of complex parsing and emulation code. Unlike well-known approaches such as bag-of-words models, which ignore spatial information, our neural network examines content at hierarchical spatial scales, allowing our model to capture locality and yielding superior accuracy compared to bag-of-words baselines. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Advanced Malware Detection Techniques · Network Security and Intrusion Detection

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings