Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties
C\'elian Ringwald, Fabien Gandon, Catherine Faron, Franck Michel, Hanna Abi Akl

TL;DR
This paper explores methods to improve small language models' ability to extract complete RDF graphs, especially addressing challenges with rare properties due to data imbalance, and offers practical training strategies.
Contribution
It identifies the long-tail distribution as a key challenge and proposes effective data balancing strategies, including dataset scaling and synthetic data augmentation, for shape-based relation extraction.
Findings
Balanced training sets improve extraction of rare properties
Synthetic data augmentation enhances model performance
Reproducible datasets and code are provided
Abstract
Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
