Low cost page quality factors to detect web spam

Ashish Chandra; Mohammad Suaib; and Dr. Rizwan Beg

arXiv:1410.2085·cs.IR·October 9, 2014·5 cites

Low cost page quality factors to detect web spam

Ashish Chandra, Mohammad Suaib, and Dr. Rizwan Beg

PDF

Open Access

TL;DR

This paper introduces 32 low-cost, real-time web spam detection features across URL, content, and link categories, utilizing a neural network classifier for improved search engine result quality.

Contribution

The paper presents a novel set of 32 lightweight features and a neural network-based classifier for real-time web spam detection, enhancing search engine accuracy.

Findings

01

Achieved high accuracy with the proposed classifier.

02

Features require minimal CPU resources for real-time application.

03

Effective detection of spam versus legitimate pages.

Abstract

Web spam is a big challenge for quality of search engine results. It is very important for search engines to detect web spam accurately. In this paper we present 32 low cost quality factors to classify spam and ham pages on real time basis. These features can be divided in to three categories: (i) URL features, (ii) Content features, and (iii) Link features. We developed a classifier using Resilient Back-propagation learning algorithm of neural network and obtained good accuracy. This classifier can be applied to search engine results on real time because calculation of these features require very little CPU resources.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Text and Document Classification Technologies · Web Data Mining and Analysis