Local and Global Topics in Text Modeling of Web Pages Nested in Web Sites
Jason Wang, Robert E. Weiss

TL;DR
This paper introduces hierarchical local and global topic models for nested web page collections, enabling explicit identification of site-specific topics and improving health topic coverage analysis.
Contribution
It proposes a hierarchical local topic model that explicitly labels local topics and identifies their owning web sites, enhancing analysis of nested web page collections.
Findings
Local topics are unique to individual web sites.
Hierarchical models improve identification of site-specific topics.
Application to US health web sites reveals local and global topic coverage.
Abstract
Topic models are popular models for analyzing a collection of text documents. The models assert that documents are distributions over latent topics and latent topics are distributions over words. A nested document collection is where documents are nested inside a higher order structure such as stories in a book, articles in a journal, or web pages in a web site. In a single collection of documents, topics are global, or shared across all documents. For web pages nested in web sites, topic frequencies likely vary between web sites. Within a web site, topic frequencies almost certainly vary between web pages. A hierarchical prior for topic frequencies models this hierarchical structure and specifies a global topic distribution. Web site topic distributions vary around the global topic distribution and web page topic distributions vary around the web site topic distribution. In a nested…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Computational and Text Analysis Methods
