A Robust Text Processing Technique Applied to Lexical Error Recovery

Peter Ingels (Linkoping University; Sweden)

arXiv:cmp-lg/9702003·cmp-lg·September 25, 2009·39 cites

A Robust Text Processing Technique Applied to Lexical Error Recovery

Peter Ingels (Linkoping University, Sweden)

PDF

Open Access

TL;DR

This paper presents CTR, a unified framework combining language models and typing behavior to automatically correct lexical errors and improve tokenization in corrupt text inputs, demonstrated on dialogue and transcription data.

Contribution

It introduces a novel integrated approach using Hidden Markov Models and weak language models within a Token Passing framework for robust lexical error recovery.

Findings

01

High correction accuracy for segmentation errors

02

Effective correction of misspellings and real-word errors

03

Minimal introduction of noise during correction

Abstract

This thesis addresses automatic lexical error recovery and tokenization of corrupt text input. We propose a technique that can automatically correct misspellings, segmentation errors and real-word errors in a unified framework that uses both a model of language production and a model of the typing behavior, and which makes tokenization part of the recovery process. The typing process is modeled as a noisy channel where Hidden Markov Models are used to model the channel characteristics. Weak statistical language models are used to predict what sentences are likely to be transmitted through the channel. These components are held together in the Token Passing framework which provides the desired tight coupling between orthographic pattern matching and linguistic expectation. The system, CTR (Connected Text Recognition), has been tested on two corpora derived from two different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis