Visual Representations: Defining Properties and Deep Approximations
Stefano Soatto, Alessandro Chiuso

TL;DR
This paper formalizes visual representations as minimal sufficient and invariant statistics, linking them analytically to common computer vision features and neural network practices, clarifying underlying assumptions and approximations.
Contribution
It provides analytical expressions for invariant sufficient statistics and connects them to existing feature descriptors and CNN operations, revealing their theoretical foundations.
Findings
Derived explicit formulas for minimal sufficient invariant representations.
Linked feature descriptors and CNN practices to theoretical properties.
Explained empirical techniques like pooling and normalization through formal analysis.
Abstract
Visual representations are defined in terms of minimal sufficient statistics of visual data, for a class of tasks, that are also invariant to nuisance variability. Minimal sufficiency guarantees that we can store a representation in lieu of raw data with smallest complexity and no performance loss on the task at hand. Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data. We derive analytical expressions for such representations and show they are related to feature descriptors commonly used in computer vision, as well as to convolutional neural networks. This link highlights the assumptions and approximations tacitly assumed by these methods and explains empirical practices such as clamping, pooling and joint normalization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Advanced Vision and Imaging
