Automatically detecting scientific political science texts from a large general document index
Nina Smirnova

TL;DR
This paper presents a combined keyword and BERT-based filtering method to accurately identify political science articles within a large general document index, enhancing literature retrieval efficiency.
Contribution
It introduces a novel hybrid filtering approach using weighted keywords and multilingual BERT classification for domain-specific article detection.
Findings
Weighted keyword filter achieved 88% accuracy.
BERT-based classifier reached 98% accuracy.
Method improves filtering across scientific domains.
Abstract
This technical report outlines the filtering approach applied to the collection of the Bielefeld Academic Search Engine (BASE) data to extract articles from the political science domain. We combined hard and soft filters to address entries with different available metadata, e.g. title, abstract or keywords. The hard filter is a weighted keyword-based approach. The soft filter uses a multilingual BERT-based classification model, trained to detect scientific articles from the political science domain. We evaluated both approaches using an annotated dataset, consisting of scientific articles from different scientific domains. The weighted keyword-based approach achieved the highest total accuracy of 0.88. The multilingual BERT-based classification model was fine-tuned using a dataset of 14,178 abstracts from scientific articles and reached the highest total accuracy of 0.98. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
