Scalable Data Balancing for Unlabeled Satellite Imagery
Deep Patel, Erin Gao, Anirudh Koul, Siddha Ganju, Meher Anand Kasam

TL;DR
This paper introduces an iterative method to balance unlabeled satellite imagery data by using image embeddings as proxies for labels, addressing the challenge of data imbalance without requiring manual annotations.
Contribution
The paper proposes a novel iterative approach that leverages image embeddings to balance large-scale unlabeled satellite datasets, improving accuracy without manual labeling.
Findings
Method effectively balances unlabeled satellite data
Increases overall accuracy of models trained on the balanced data
Applicable to large-scale satellite imagery datasets
Abstract
Data imbalance is a ubiquitous problem in machine learning. In large scale collected and annotated datasets, data imbalance is either mitigated manually by undersampling frequent classes and oversampling rare classes, or planned for with imputation and augmentation techniques. In both cases balancing data requires labels. In other words, only annotated data can be balanced. Collecting fully annotated datasets is challenging, especially for large scale satellite systems such as the unlabeled NASA's 35 PB Earth Imagery dataset. Although the NASA Earth Imagery dataset is unlabeled, there are implicit properties of the data source that we can rely on to hypothesize about its imbalance, such as distribution of land and water in the case of the Earth's imagery. We present a new iterative method to balance unlabeled data. Our method utilizes image embeddings as a proxy for image labels that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Artificial Intelligence in Healthcare · Data Quality and Management
