
TL;DR
This paper introduces the Web Topical Discovery System (WTDS), a software tool designed for automatic discovery and selection of web pages relevant to specific research topics, improving relevance and reducing false positives.
Contribution
It presents a new approach and implementation for automatically discovering web content relevant to specific topics, including techniques for filtering extraneous data and analyzing content richness.
Findings
Effective removal of extraneous data from web pages
Improved relevance in web page discovery
Enhanced filtering of false positives
Abstract
This work describes the theory and the implementation of a new software tool, the "Web Topical Discovery System" (WTDS), which provides an approach to the automatic discovery and selection of new web pages relevant to specific analytical needs. We will see how it is possible to specify the research context with search keywords related to the area of interest and consider the important problem of removing extraneous data from a web page containing an article in order to reduce, to a minimum, false positives represented by a match on a keyword that is showing up on the latest news box of the same page. The removal of duplicates, the analysis of richness of information contained in the article and lexical diversity are all taken into consideration in order to provide the optimum set of recommendations to the end user or system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Semantic Web and Ontologies
