TL;DR
This paper examines the challenges and opportunities of implementing nearly-unsupervised information extraction workflows in digital libraries, analyzing case studies across various domains to assess quality and practical handling.
Contribution
It provides an analysis of unsupervised extraction workflows in digital libraries, highlighting opportunities, limitations, and best practices based on case studies.
Findings
Unsupervised extraction can be effective but produces non-canonicalized results.
Domain-specific data influences extraction quality.
Best practices can improve workflow reliability.
Abstract
Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper tackles the question how digital libraries can handle such extractions and if their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), pharmacy and political sciences. We report on opportunities and limitations. Finally we discuss best practices for unsupervised extraction workflows.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
