Building Probabilistic Models for Natural Language
Stanley F. Chen (Harvard University)

TL;DR
This thesis advances probabilistic language models by improving smoothing, grammar induction, and sentence alignment across different language levels, addressing key issues like data sparsity and hidden structure induction.
Contribution
It introduces improved techniques for language modeling at multiple levels and relates them to Bayesian frameworks, surpassing existing algorithms in performance.
Findings
Enhanced smoothing techniques for n-gram models
More accurate statistical grammar induction methods
Improved bilingual sentence alignment algorithms
Abstract
In this thesis, we investigate three problems involving the probabilistic modeling of language: smoothing n-gram models, statistical grammar induction, and bilingual sentence alignment. These three problems employ models at three different levels of language; they involve word-based, constituent-based, and sentence-based models, respectively. We describe techniques for improving the modeling of language at each of these levels, and surpass the performance of existing algorithms for each problem. We approach the three problems using three different frameworks. We relate each of these frameworks to the Bayesian paradigm, and show why each framework used was appropriate for the given problem. Finally, we show how our research addresses two central issues in probabilistic modeling: the sparse data problem and the problem of inducing hidden structure.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
