Topical Discovery of Web Content

Giancarlo Crocetti

arXiv:1507.02002·cs.IR·July 9, 2015

Topical Discovery of Web Content

Giancarlo Crocetti

PDF

Open Access

TL;DR

This paper introduces the Web Topical Discovery System (WTDS), a software tool designed for automatic discovery and selection of web pages relevant to specific research topics, improving relevance and reducing false positives.

Contribution

It presents a new approach and implementation for automatically discovering web content relevant to specific topics, including techniques for filtering extraneous data and analyzing content richness.

Findings

01

Effective removal of extraneous data from web pages

02

Improved relevance in web page discovery

03

Enhanced filtering of false positives

Abstract

This work describes the theory and the implementation of a new software tool, the "Web Topical Discovery System" (WTDS), which provides an approach to the automatic discovery and selection of new web pages relevant to specific analytical needs. We will see how it is possible to specify the research context with search keywords related to the area of interest and consider the important problem of removing extraneous data from a web page containing an article in order to reduce, to a minimum, false positives represented by a match on a keyword that is showing up on the latest news box of the same page. The removal of duplicates, the analysis of richness of information contained in the article and lexical diversity are all taken into consideration in order to provide the optimum set of recommendations to the end user or system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Semantic Web and Ontologies