Using Deep Learning for Title-Based Semantic Subject Indexing to Reach   Competitive Performance to Full-Text

Florian Mai; Lukas Galke; Ansgar Scherp

arXiv:1801.06717·cs.DL·May 30, 2018

Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text

Florian Mai, Lukas Galke, Ansgar Scherp

PDF

1 Repo

TL;DR

This study demonstrates that deep learning models trained on large-scale title data can achieve nearly comparable or superior semantic subject indexing performance to models trained on full-texts, especially in the medical and economics domains.

Contribution

The paper introduces deep learning classifiers trained on large title datasets, showing they can rival or outperform full-text models in subject indexing tasks.

Findings

01

Title-based classifiers outperform full-text classifiers on EconBiz by 9.4%.

02

On PubMed, title-based methods are within 2.9% of full-text performance.

03

Large-scale title data can effectively substitute full-text data for semantic indexing.

Abstract

For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

florianmai/Quadflor
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.