WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT

Ye Tian; Zhang Yumin; Yifan Jia; Jianguo Sun; Yanbin Wang

arXiv:2506.19356·cs.CR·June 25, 2025

WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT

Ye Tian, Zhang Yumin, Yifan Jia, Jianguo Sun, Yanbin Wang

PDF

Open Access

TL;DR

WebGuard++ introduces a novel, interpretable framework for malicious URL detection by effectively fusing URL and HTML features through hierarchical encoding, subgraph analysis, and bidirectional interaction, significantly outperforming existing methods.

Contribution

The paper presents four innovative components that jointly enhance malicious URL detection by capturing comprehensive URL semantics, amplifying threat signals in HTML graphs, and enabling interpretable, bidirectional feature interaction.

Findings

01

Achieves 1.1x-7.9x higher TPR at fixed FPRs compared to baselines.

02

Effectively localizes malicious regions within HTML DOM structures.

03

Demonstrates robustness across multiple datasets.

Abstract

URL+HTML feature fusion shows promise for robust malicious URL detection, since attacker artifacts persist in DOM structures. However, prior work suffers from four critical shortcomings: (1) incomplete URL modeling, failing to jointly capture lexical patterns and semantic context; (2) HTML graph sparsity, where threat-indicative nodes (e.g., obfuscated scripts) are isolated amid benign content, causing signal dilution during graph aggregation; (3) unidirectional analysis, ignoring URL-HTML feature bidirectional interaction; and (4) opaque decisions, lacking attribution to malicious DOM components. To address these challenges, we present WebGuard++, a detection framework with 4 novel components: 1) Cross-scale URL Encoder: Hierarchically learns local-to-global and coarse to fine URL features based on Transformer network with dynamic convolution. 2) Subgraph-aware HTML Encoder: Decomposes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Web Data Mining and Analysis · Misinformation and Its Impacts

MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Umbrella Reinforcement Learning