Fair Comparison: Quantifying Variance in Resultsfor Fine-grained Visual Categorization
Matthew Gwilliam, Adam Teuscher, Connor Anderson, Ryan Farrell

TL;DR
This paper investigates the variability in fine-grained visual categorization results, emphasizing the importance of considering variance and per-class performance alongside average accuracy for more reliable model evaluation.
Contribution
It quantifies the extent of performance variance in FGVC models and highlights the need for comprehensive metrics beyond average accuracy.
Findings
Significant variation exists across models and class distributions.
Per-class performance varies notably even among similar models.
Certain techniques can reduce variance in FGVC results.
Abstract
For the task of image classification, researchers work arduously to develop the next state-of-the-art (SOTA) model, each bench-marking their own performance against that of their predecessors and of their peers. Unfortunately, the metric used most frequently to describe a model's performance, average categorization accuracy, is often used in isolation. As the number of classes increases, such as in fine-grained visual categorization (FGVC), the amount of information conveyed by average accuracy alone dwindles. While its most glaring weakness is its failure to describe the model's performance on a class-by-class basis, average accuracy also fails to describe how performance may vary from one trained model of the same architecture, on the same dataset, to another (both averaged across all categories and at the per-class level). We first demonstrate the magnitude of these variations across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
