Automated Resolution of Noisy Bibliographic References
Markus Demleitner, Michael Kurtz, Alberto Accomazzi, G\"unther, Eichhorn, Carolyn S. Grant, Steven S. Murray

TL;DR
This paper presents a new method for accurately resolving noisy bibliographic references from OCR scans by integrating correction, parsing, and matching processes inspired by dependency grammars, improving recall over traditional approaches.
Contribution
It introduces a novel approach that merges correction, parsing, and matching steps for noisy references, enhancing accuracy in bibliographic data retrieval.
Findings
The proposed method improves recall in resolving noisy references.
Heuristics significantly enhance the system's effectiveness.
The approach outperforms traditional sequential correction and matching methods.
Abstract
We describe a system used by the NASA Astrophysics Data System to identify bibliographic references obtained from scanned article pages by OCR methods with records in a bibliographic database. We analyze the process generating the noisy references and conclude that the three-step procedure of correcting the OCR results, parsing the corrected string and matching it against the database provides unsatisfactory results. Instead, we propose a method that allows a controlled merging of correction, parsing and matching, inspired by dependency grammars. We also report on the effectiveness of various heuristics that we have employed to improve recall.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Web Data Mining and Analysis · Natural Language Processing Techniques
