Text sampling strategies for predicting missing bibliographic links
F. V. Krasnova, I. S. Smaznevicha, E. N. Baskakova

TL;DR
This paper explores various text sampling strategies for sentence classification to detect missing bibliographic links, demonstrating that including sentence context and ensemble voting significantly improves accuracy.
Contribution
It introduces a novel sampling approach that incorporates sentence context and ensemble voting to optimize classification of missing bibliographic links.
Findings
Including sentence context improves classification accuracy.
Ensemble voting determines optimal sampling strategies.
Achieved 98% F1-score in link detection.
Abstract
The paper proposes various strategies for sampling text data when performing automatic sentence classification for the purpose of detecting missing bibliographic links. We construct samples based on sentences as semantic units of the text and add their immediate context which consists of several neighboring sentences. We examine a number of sampling strategies that differ in context size and position. The experiment is carried out on the collection of STEM scientific papers. Including the context of sentences into samples improves the result of their classification. We automatically determine the optimal sampling strategy for a given text collection by implementing an ensemble voting when classifying the same data sampled in different ways. Sampling strategy taking into account the sentence context with hard voting procedure leads to the classification accuracy of 98% (F1-score). This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Research and Philosophical Inquiry · Information Systems and Technology Applications
