Measures of Complexity for Large Scale Image Datasets

Ameet Annasaheb Rahane; Anbumani Subramanian

arXiv:2008.04431·cs.CV·August 12, 2020

Measures of Complexity for Large Scale Image Datasets

Ameet Annasaheb Rahane, Anbumani Subramanian

PDF

TL;DR

This paper introduces simple, computationally efficient methods to quantify and visualize the complexity of large-scale image datasets, aiding in dataset comparison and understanding in machine learning.

Contribution

It proposes novel entropy-based complexity metrics and visualization techniques for high-dimensional datasets, applied to autonomous driving image datasets.

Findings

01

Entropy metrics rank datasets by complexity

02

Visualizations assist in dataset comparison

03

Complexity correlates with deep learning difficulty

Abstract

Large scale image datasets are a growing trend in the field of machine learning. However, it is hard to quantitatively understand or specify how various datasets compare to each other - i.e., if one dataset is more complex or harder to ``learn'' with respect to a deep-learning based network. In this work, we build a series of relatively computationally simple methods to measure the complexity of a dataset. Furthermore, we present an approach to demonstrate visualizations of high dimensional data, in order to assist with visual comparison of datasets. We present our analysis using four datasets from the autonomous driving research community - Cityscapes, IDD, BDD and Vistas. Using entropy based metrics, we present a rank-order complexity of these datasets, which we compare with an established rank-order with respect to deep learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.