fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

Weijia Xu; Nebojsa Jojic; Nicolas Le Roux

arXiv:2410.05481·cs.LG·August 27, 2025

fLSA: Learning Semantic Structures in Document Collections Using Foundation Models

Weijia Xu, Nebojsa Jojic, Nicolas Le Roux

PDF

Open Access 1 Video

TL;DR

fLSA is a novel method that uses foundation models to induce high-level semantic structures in documents through iterative clustering and tagging, improving text reconstruction and hierarchical sampling for reasoning tasks.

Contribution

The paper introduces fLSA, a new approach leveraging foundation models for semantic structure induction and hierarchical sampling in document collections.

Findings

01

fLSA tags outperform existing methods in reconstructing original texts.

02

Hierarchical sampling with fLSA increases the likelihood of correct solutions.

03

fLSA effectively models latent document structures for diverse tasks.

Abstract

Humans can learn to solve new tasks by inducing high-level strategies from example solutions to similar problems and then adapting these strategies to solve unseen problems. Can we use large language models to induce such high-level structure from example documents or solutions? We introduce fLSA, a foundation-model-based Latent Semantic Analysis method that iteratively clusters and tags document segments based on document-level contexts. These tags can be used to model the latent structure of given documents and for hierarchical sampling of new texts. Our experiments on story writing, math, and multi-step reasoning datasets demonstrate that fLSA tags are more informative in reconstructing the original texts than existing tagging methods. Moreover, when used for hierarchical sampling, fLSA tags help expand the output space in the right directions that lead to correct solutions more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

fLSA: Learning Semantic Structures in Document Collections Using Foundation Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling