Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on   Short Documents

Zheng Tracy Ke; Jingming Wang

arXiv:2405.17806·math.ST·May 29, 2024

Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents

Zheng Tracy Ke, Jingming Wang

PDF

Open Access

TL;DR

This paper establishes the optimal estimation rate for topic models on short documents, introduces entry-wise bounds for empirical singular vectors, and improves spectral algorithm performance to match the theoretical limit.

Contribution

It provides new entry-wise large-deviation bounds and demonstrates that the optimal rate for topic modeling remains the same in short-document scenarios.

Findings

01

Optimal rate for short documents is

02

Improved spectral algorithm matches the minimax lower bound in short-document case

03

Entry-wise bounds enhance understanding of empirical singular vectors in topic models

Abstract

Topic modeling is a widely utilized tool in text analysis. We investigate the optimal rate for estimating a topic model. Specifically, we consider a scenario with $n$ documents, a vocabulary of size $p$ , and document lengths at the order $N$ . When $N \geq c \cdot p$ , referred to as the long-document case, the optimal rate is established in the literature at $p / (N n)$ . However, when $N = o (p)$ , referred to as the short-document case, the optimal rate remains unknown. In this paper, we first provide new entry-wise large-deviation bounds for the empirical singular vectors of a topic model. We then apply these bounds to improve the error rate of a spectral algorithm, Topic-SCORE. Finally, by comparing the improved error rate with the minimax lower bound, we conclude that the optimal rate is still $p / (N n)$ in the short-document case.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling