Data Augmentation Techniques for Process Extraction from Scientific Publications
Yuni Susanti

TL;DR
This paper introduces data augmentation techniques tailored for process extraction in scientific publications, significantly enhancing model accuracy and robustness, especially in low-resource chemistry datasets.
Contribution
The paper proposes novel data augmentation methods that leverage process-specific information, role label similarity, and sentence similarity for improved process extraction.
Findings
Up to 12.3 points improvement in F-score.
Enhanced performance on small and low-resource datasets.
Potential reduction in overfitting.
Abstract
We present data augmentation techniques for process extraction tasks in scientific publications. We cast the process extraction task as a sequence labeling task where we identify all the entities in a sentence and label them according to their process-specific roles. The proposed method attempts to create meaningful augmented sentences by utilizing (1) process-specific information from the original sentence, (2) role label similarity, and (3) sentence similarity. We demonstrate that the proposed methods substantially improve the performance of the process extraction model trained on chemistry domain datasets, up to 12.3 points improvement in performance accuracy (F-score). The proposed methods could potentially reduce overfitting as well, especially when training on small datasets or in a low-resource setting such as in chemistry and other scientific domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
