Toric grammars: a new statistical approach to natural language modeling
Olivier Catoni, Thomas Mainguy

TL;DR
This paper introduces a novel statistical language model using a Markov chain on sets of sentences, called a communication model, which recombines sentences based on grammar rules and differs from traditional Markov and context-free models.
Contribution
It proposes a new Markov chain-based approach for language modeling that leverages grammar rules and explores its mathematical properties and relationship with context-free grammars.
Findings
The model defines invariant probability measures on recurrent classes.
All states in the fixed-grammar case are recurrent, forming finite classes.
The approach offers a new perspective on language transmission modeling.
Abstract
We propose a new statistical model for computational linguistics. Rather than trying to estimate directly the probability distribution of a random sentence of the language, we define a Markov chain on finite sets of sentences with many finite recurrent communicating classes and define our language model as the invariant probability measures of the chain on each recurrent communicating class. This Markov chain, that we call a communication model, recombines at each step randomly the set of sentences forming its current state, using some grammar rules. When the grammar rules are fixed and known in advance instead of being estimated on the fly, we can prove supplementary mathematical properties. In particular, we can prove in this case that all states are recurrent states, so that the chain defines a partition of its state space into finite recurrent communicating classes. We show that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · semigroups and automata theory
