Visually Consistent Hierarchical Image Classification
Seulki Park, Youren Zhang, Stella X. Yu, Sara Beery, Jonathan Huang

TL;DR
This paper introduces a novel approach for hierarchical image classification that enforces internal visual consistency via intra-image segmentation, leading to improved accuracy and more reliable predictions without needing pixel-level annotations.
Contribution
It is the first method to align fine-to-coarse predictions through intra-image segmentation to ensure visual consistency in hierarchical classification.
Findings
Outperforms zero-shot CLIP and state-of-the-art baselines
Achieves higher accuracy in hierarchical classification
Improves internal image segmentation without pixel annotations
Abstract
Hierarchical classification predicts labels across multiple levels of a taxonomy, e.g., from coarse-level 'Bird' to mid-level 'Hummingbird' to fine-level 'Green hermit', allowing flexible recognition under varying visual conditions. It is commonly framed as multiple single-level tasks, but each level may rely on different visual cues: Distinguishing 'Bird' from 'Plant' relies on global features like feathers or leaves, while separating 'Anna's hummingbird' from 'Green hermit' requires local details such as head coloration. Prior methods improve accuracy using external semantic supervision, but such statistical learning criteria fail to ensure consistent visual grounding at test time, resulting in incorrect hierarchical classification. We propose, for the first time, to enforce internal visual consistency by aligning fine-to-coarse predictions through intra-image segmentation. Our method…
Peer Reviews
Decision·ICLR 2025 Poster
(1) The model shows better results than prior works. (2) The examples given in the introduction help to explain and illustrate the reasoning behind the hierarchical focus. (3) The ablations confirm that the added additional loss contributes to the performance.
(1) In L61, the difference in available labelling is presented as one of the motivations for hierarchical classification. However, the presented method assumes that all levels of the hierarchy are available. Can the technique work if the finest levels of supervision are not available? (2) Similarly, given the availability of the finest-level label, the other course levels in the tree are implied, so perhaps a more appropriate flat baseline would be a ViT that only predicts finest-level classes
1. The motivation for visual consistency is sound. 2. This work introduces new metrics for hierarchical classification, which can measure the coherence of hierarchical predictions. 3. The proposed method achieves SOTA on all datasets.
1. The key point of this paper, which is making classifiers at different levels attend to consistent visual cues, lacks support from quantitative results or theory. For example, if wrongly classified cases are all associated with incorrect CAMs? What is the transfer rate after adopting the proposed model? Qualitative comparisons are subjective and lack statistical significance. In addition, Grad-CAM is an approximation and does not truly explain how the network operates. 2. This work relies heav
The paper is well written and organized. The authors have made a good effort to put the problem in the context of the state of the art. The problematic behavior of the state-of-the-art architecture is well illustrated in Figures 1 and 2. In general, the illustrations and explanations allow a non-expert to understand the specificity of hierarchical classification, the state-of-the-art choices, and the resulting problems. I also appreciated the care taken in explaining the metrics to understand wh
In part 4.4, I'm not sure how to interpret the result. We can assume that the result is a failure if the superpixel segmentation doesn't make sense. But to me, this just shows that superpixel segmentation is not necessarily adapted to a semantic problem, because it's done at the color level, and semantic in images is not necessarily associated with color. In Figures 1 and 4, I find that the space between class names (like "Chimpanzeeferret" in Figure 1) and their position/alignment in the tre
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
