Surprisingly High Redundancy in Electronic Structure Data Across Materials Explained by Low Intrinsic Dimensionality

Sazzad Hossain; Ponkrshnan Thiagarajan; Shashank Pathrudkar; Stephanie Taylor; Abhijeet S. Gangan; Amartya S. Banerjee; and Susanta Ghosh

arXiv:2507.09001·cond-mat.mtrl-sci·May 4, 2026·2 cites

Surprisingly High Redundancy in Electronic Structure Data Across Materials Explained by Low Intrinsic Dimensionality

Sazzad Hossain, Ponkrshnan Thiagarajan, Shashank Pathrudkar, Stephanie Taylor, Abhijeet S. Gangan, Amartya S. Banerjee, and Susanta Ghosh

PDF

TL;DR

This paper uncovers significant redundancy in electronic structure datasets across materials, attributing it to low intrinsic dimensionality, and demonstrates effective data pruning strategies that maintain accuracy while greatly reducing dataset size and training time.

Contribution

It reveals the low-dimensional manifold structure of electronic data and introduces pruning methods that drastically reduce dataset size without sacrificing predictive accuracy.

Findings

01

Random pruning reduces dataset size with minimal accuracy loss.

02

Coverage-based pruning preserves accuracy using up to 100 times less data.

03

Electronic structure data lies on a low-dimensional, non-linear manifold.

Abstract

Machine learning (ML) models for electronic structure typically rely on large datasets generated by computationally expensive Kohn-Sham density functional theory calculations, as it is not known a priori which portions of the data are essential for accurate learning. Here, we reveal significant redundancies in electronic structure datasets across diverse material systems and attribute them to the low intrinsic dimensionality of the underlying data. We show that even random pruning can substantially reduce dataset size with minimal degradation in predictive accuracy. Moreover, a state-of-the-art coverage-based pruning strategy that samples data across all learning difficulties preserves chemical accuracy and model generalizability while using up to two orders of magnitude less data and reducing training time by a factor of three or more. We further demonstrate that the essential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.