GraphSculptor: Sculpting Pre-training Coreset for Graph Self-supervised Learning
Chuang Liu, Zelin Yao, Xueqi Ma, Luzhi Wang, Mukun Chen, Pinghua Xu, Wenbin Hu

TL;DR
GraphSculptor introduces a label-free coreset construction method for graph self-supervised learning, significantly reducing data and computational requirements while maintaining high downstream performance.
Contribution
It proposes a novel, unsupervised coreset construction approach combining structural and semantic diversity, with theoretical guarantees and practical efficiency.
Findings
A 10% coreset retains 99.6% of full-data performance.
Pre-training time is reduced by nearly 90%.
The method outperforms existing approaches in data efficiency.
Abstract
Graph self-supervised learning typically relies on large-scale unlabeled datasets, heavily inflating computational costs. However, empirical evidence suggests that these datasets contain substantial redundancy-our analysis reveals that uniformly subsampling 50% of graphs retains over 96% of downstream performance. To exploit this redundancy, we introduce GraphSculptor for pre-training coreset construction. Unlike methods dependent on additional training-time signals or limited solely to topological statistics, GraphSculptor provides a label-free solution that constructs coresets via two complementary perspectives: intrinsic structure and contextual semantics. Concretely, structural diversity is quantified using intrinsic graph statistics, yielding a structural feature vector for each graph, while semantic diversity is captured by utilizing a pre-trained language model to encode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
