WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT
Ye Tian, Zhang Yumin, Yifan Jia, Jianguo Sun, Yanbin Wang

TL;DR
WebGuard++ introduces a novel, interpretable framework for malicious URL detection by effectively fusing URL and HTML features through hierarchical encoding, subgraph analysis, and bidirectional interaction, significantly outperforming existing methods.
Contribution
The paper presents four innovative components that jointly enhance malicious URL detection by capturing comprehensive URL semantics, amplifying threat signals in HTML graphs, and enabling interpretable, bidirectional feature interaction.
Findings
Achieves 1.1x-7.9x higher TPR at fixed FPRs compared to baselines.
Effectively localizes malicious regions within HTML DOM structures.
Demonstrates robustness across multiple datasets.
Abstract
URL+HTML feature fusion shows promise for robust malicious URL detection, since attacker artifacts persist in DOM structures. However, prior work suffers from four critical shortcomings: (1) incomplete URL modeling, failing to jointly capture lexical patterns and semantic context; (2) HTML graph sparsity, where threat-indicative nodes (e.g., obfuscated scripts) are isolated amid benign content, causing signal dilution during graph aggregation; (3) unidirectional analysis, ignoring URL-HTML feature bidirectional interaction; and (4) opaque decisions, lacking attribution to malicious DOM components. To address these challenges, we present WebGuard++, a detection framework with 4 novel components: 1) Cross-scale URL Encoder: Hierarchically learns local-to-global and coarse to fine URL features based on Transformer network with dynamic convolution. 2) Subgraph-aware HTML Encoder: Decomposes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Web Data Mining and Analysis · Misinformation and Its Impacts
MethodsLayer Normalization · Dropout · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Softmax · Label Smoothing · Transformer · Umbrella Reinforcement Learning
