A Robust Text Processing Technique Applied to Lexical Error Recovery
Peter Ingels (Linkoping University, Sweden)

TL;DR
This paper presents CTR, a unified framework combining language models and typing behavior to automatically correct lexical errors and improve tokenization in corrupt text inputs, demonstrated on dialogue and transcription data.
Contribution
It introduces a novel integrated approach using Hidden Markov Models and weak language models within a Token Passing framework for robust lexical error recovery.
Findings
High correction accuracy for segmentation errors
Effective correction of misspellings and real-word errors
Minimal introduction of noise during correction
Abstract
This thesis addresses automatic lexical error recovery and tokenization of corrupt text input. We propose a technique that can automatically correct misspellings, segmentation errors and real-word errors in a unified framework that uses both a model of language production and a model of the typing behavior, and which makes tokenization part of the recovery process. The typing process is modeled as a noisy channel where Hidden Markov Models are used to model the channel characteristics. Weak statistical language models are used to predict what sentences are likely to be transmitted through the channel. These components are held together in the Token Passing framework which provides the desired tight coupling between orthographic pattern matching and linguistic expectation. The system, CTR (Connected Text Recognition), has been tested on two corpora derived from two different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
