Multi-View Learning for Web Spam Detection

Ali Hadian; Behrouz Minaei-Bidgoli

arXiv:1305.3814·cs.IR·July 25, 2013·1 cites

Multi-View Learning for Web Spam Detection

Ali Hadian, Behrouz Minaei-Bidgoli

PDF

Open Access

TL;DR

This paper proposes a multi-view learning approach for web spam detection that combines multiple feature-based classifiers to improve accuracy and scalability, achieving a 22% increase in AUC.

Contribution

It introduces a multi-view classification system that effectively integrates different feature sets for web spam detection, enhancing performance and efficiency.

Findings

01

Multi-view learning improves spam classification AUC by 22%.

02

The system achieves linear speedup with parallel execution.

03

Classifies web pages accurately using only HTML content.

Abstract

Spam pages are designed to maliciously appear among the top search results by excessive usage of popular terms. Therefore, spam pages should be removed using an effective and efficient spam detection system. Previous methods for web spam classification used several features from various information sources (page contents, web graph, access logs, etc.) to detect web spam. In this paper, we follow page-level classification approach to build fast and scalable spam filters. We show that each web page can be classified with satisfiable accuracy using only its own HTML content. In order to design a multi-view classification system, we used state-of-the-art spam classification methods with distinct feature sets (views) as the base classifiers. Then, a fusion model is learned to combine the output of the base classifiers and make final prediction. Results show that multi-view learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Web Data Mining and Analysis · Text and Document Classification Technologies