Building Probabilistic Models for Natural Language

Stanley F. Chen (Harvard University)

arXiv:cmp-lg/9606014·cmp-lg·February 3, 2008·93 cites

Building Probabilistic Models for Natural Language

Stanley F. Chen (Harvard University)

PDF

Open Access

TL;DR

This thesis advances probabilistic language models by improving smoothing, grammar induction, and sentence alignment across different language levels, addressing key issues like data sparsity and hidden structure induction.

Contribution

It introduces improved techniques for language modeling at multiple levels and relates them to Bayesian frameworks, surpassing existing algorithms in performance.

Findings

01

Enhanced smoothing techniques for n-gram models

02

More accurate statistical grammar induction methods

03

Improved bilingual sentence alignment algorithms

Abstract

In this thesis, we investigate three problems involving the probabilistic modeling of language: smoothing n-gram models, statistical grammar induction, and bilingual sentence alignment. These three problems employ models at three different levels of language; they involve word-based, constituent-based, and sentence-based models, respectively. We describe techniques for improving the modeling of language at each of these levels, and surpass the performance of existing algorithms for each problem. We approach the three problems using three different frameworks. We relate each of these frameworks to the Bayesian paradigm, and show why each framework used was appropriate for the given problem. Finally, we show how our research addresses two central issues in probabilistic modeling: the sparse data problem and the problem of inducing hidden structure.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression