TL;DR
This study demonstrates that deep learning models trained on large-scale title data can achieve nearly comparable or superior semantic subject indexing performance to models trained on full-texts, especially in the medical and economics domains.
Contribution
The paper introduces deep learning classifiers trained on large title datasets, showing they can rival or outperform full-text models in subject indexing tasks.
Findings
Title-based classifiers outperform full-text classifiers on EconBiz by 9.4%.
On PubMed, title-based methods are within 2.9% of full-text performance.
Large-scale title data can effectively substitute full-text data for semantic indexing.
Abstract
For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
