Efficient Induction of Language Models Via Probabilistic Concept Formation
Christopher J. MacLellan, Peter Matsakis, Pat Langley

TL;DR
This paper introduces three extensions to the Cobweb system for incremental, probabilistic language model induction, enabling online processing of sequential language data and improving synonym grouping and homonym separation.
Contribution
The paper develops Word, Leaf, and Path variants of Cobweb that encode language context and update hierarchies incrementally, adapting a taxonomic approach to sequential language learning.
Findings
Effective synonym grouping demonstrated
Homonyms are kept apart successfully
Training efficiency is improved
Abstract
This paper presents a novel approach to the acquisition of language models from corpora. The framework builds on Cobweb, an early system for constructing taxonomic hierarchies of probabilistic concepts that used a tabular, attribute-value encoding of training cases and concepts, making it unsuitable for sequential input like language. In response, we explore three new extensions to Cobweb -- the Word, Leaf, and Path variants. These systems encode each training case as an anchor word and surrounding context words, and they store probabilistic descriptions of concepts as distributions over anchor and context information. As in the original Cobweb, a performance element sorts a new instance downward through the hierarchy and uses the final node to predict missing features. Learning is interleaved with performance, updating concept probabilities and hierarchy structure as classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
