Unveiling the semantic structure of text documents using paragraph-aware Topic Models
Sim\'on Roca-Sotelo, Jer\'onimo Arenas-Garc\'ia

TL;DR
This paper introduces paragraph-aware topic models that leverage document structure to distinguish between general and specific topics, improving topic diversity and highlighting relevant paragraphs in structured texts.
Contribution
It proposes a novel approach that incorporates paragraph structure into topic modeling, enabling the identification of general and specific topics within documents.
Findings
Enhanced ability to highlight relevant paragraphs in structured documents
Learned more diverse and meaningful topics
Effective differentiation between general and specific concepts
Abstract
Classic Topic Models are built under the Bag Of Words assumption, in which word position is ignored for simplicity. Besides, symmetric priors are typically used in most applications. In order to easily learn topics with different properties among the same corpus, we propose a new line of work in which the paragraph structure is exploited. Our proposal is based on the following assumption: in many text document corpora there are formal constraints shared across all the collection, e.g. sections. When this assumption is satisfied, some paragraphs may be related to general concepts shared by all documents in the corpus, while others would contain the genuine description of documents. Assuming each paragraph can be semantically more general, specific, or hybrid, we look for ways to measure this, transferring this distinction to topics and being able to learn what we call specific and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Data Quality and Management
