TL;DR
This paper introduces a novel approach for tokenization repair that leverages deep language models trained on error-prone text, improving the correction of tokenization and spelling errors across various real-world scenarios.
Contribution
It presents a new method using bidirectional deep language models trained on erroneous text, incorporating existing space information, and demonstrates superior performance on six diverse benchmarks.
Findings
Outperforms existing tokenization repair methods.
Enhances spell checkers by fixing more errors.
Effective across OCR, PDF extraction, and human errors.
Abstract
We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it's not part of the problem to correct them. For example, given: "Tispa per isabout token izaionrep air", compute "Tis paper is about tokenizaion repair". We identify three key ingredients of high-quality tokenization repair, all missing from previous work: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our methods also improve existing spell checkers by fixing not only more tokenization errors but also more spelling errors: once it is clear which characters form a word, it is much easier for them to figure out the correct word. We provide six benchmarks that cover three use cases (OCR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRepair
