Surprisingly High Redundancy in Electronic Structure Data Across Materials Explained by Low Intrinsic Dimensionality
Sazzad Hossain, Ponkrshnan Thiagarajan, Shashank Pathrudkar, Stephanie Taylor, Abhijeet S. Gangan, Amartya S. Banerjee, and Susanta Ghosh

TL;DR
This paper uncovers significant redundancy in electronic structure datasets across materials, attributing it to low intrinsic dimensionality, and demonstrates effective data pruning strategies that maintain accuracy while greatly reducing dataset size and training time.
Contribution
It reveals the low-dimensional manifold structure of electronic data and introduces pruning methods that drastically reduce dataset size without sacrificing predictive accuracy.
Findings
Random pruning reduces dataset size with minimal accuracy loss.
Coverage-based pruning preserves accuracy using up to 100 times less data.
Electronic structure data lies on a low-dimensional, non-linear manifold.
Abstract
Machine learning (ML) models for electronic structure typically rely on large datasets generated by computationally expensive Kohn-Sham density functional theory calculations, as it is not known a priori which portions of the data are essential for accurate learning. Here, we reveal significant redundancies in electronic structure datasets across diverse material systems and attribute them to the low intrinsic dimensionality of the underlying data. We show that even random pruning can substantially reduce dataset size with minimal degradation in predictive accuracy. Moreover, a state-of-the-art coverage-based pruning strategy that samples data across all learning difficulties preserves chemical accuracy and model generalizability while using up to two orders of magnitude less data and reducing training time by a factor of three or more. We further demonstrate that the essential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
