Large scale link based latent Dirichlet allocation for web document classification
Istv\'an B\'ir\'o, J\'acint Szab\'o

TL;DR
This paper introduces a novel link-aware LDA model for web document classification, improving speed and accuracy by propagating topics along links and boosting Gibbs sampling, with applications in web graph processing.
Contribution
The paper presents a new influence model integrating link information into LDA and develops faster Gibbs sampling methods, enhancing web document classification performance.
Findings
Achieved 4% AUC improvement over plain LDA with BayesNet
Achieved 18% AUC improvement over tf.idf with SVM
Gibbs sampling speed increased by 5-10 times with minimal accuracy loss
Abstract
In this paper we demonstrate the applicability of latent Dirichlet allocation (LDA) for classifying large Web document collections. One of our main results is a novel influence model that gives a fully generative model of the document content taking linkage into account. In our setup, topics propagate along links in such a way that linked documents directly influence the words in the linking document. As another main contribution we develop LDA specific boosting of Gibbs samplers resulting in a significant speedup in our experiments. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. In addition, the model yields link weights that can be applied in algorithms to process the Web graph; as an example we deploy LDA link weights in stacked graphical learning. By using Weka's BayesNet classifier, in terms of the AUC of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Clustering Algorithms Research · Complex Network Analysis Techniques
