Visualizing and Describing Fine-grained Categories as Textures
Tsung-Yu Lin, Mikayla Timm, Chenyun Wu, Subhransu Maji

TL;DR
This paper explores how fine-grained visual categories can be characterized by their textures through visualization and automatic description, enhancing understanding of subtle differences in species classification.
Contribution
It introduces a method to visualize and describe categories in FGVC using texture-based deep networks and a new dataset for texture captioning.
Findings
Texture-based models highlight discriminative features.
Automatic texture descriptions provide language explanations.
Visualizations improve interpretability of fine-grained categories.
Abstract
We analyze how categories from recent FGVC challenges can be described by their textural content. The motivation is that subtle differences between species of birds or butterflies can often be described in terms of the texture associated with them and that several top-performing networks are inspired by texture-based representations. These representations are characterized by orderless pooling of second-order filter activations such as in bilinear CNNs and the winner of the iNaturalist 2018 challenge. Concretely, for each category we (i) visualize the "maximal images" by obtaining inputs x that maximize the probability of the particular class according to a texture-based deep network, and (ii) automatically describe the maximal images using a set of texture attributes. The models for texture captioning were trained on our ongoing efforts on collecting a dataset of describable textures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Digital Imaging for Blood Diseases
Visualizing and Describing Fine-grained Categories as Textures
Tsung-Yu Lin Mikayla Timm Chenyun Wu Subhransu Maji
University of Massachusetts, Amherst
{tsungyulin,mtimm,chenyun,smaji}@cs.umass.edu
We analyze how categories from recent FGVC challenges [4, 5] can be described by their textural content. The motivation is that subtle differences between species of birds or butterflies can often be described in terms of the texture associated with them and that several top-performing networks are inspired by texture-based representations. These representations are characterized by orderless pooling of second-order filter activations such as in bilinear CNNs [10] and the winner of the iNaturalist 2018 challenge [8].
Concretely, for each category we (i) visualize the “maximal images” by obtaining inputs that maximize the probability of the particular class according to a texture-based deep network , and (ii) automatically describe the maximal images using a set of texture attributes. We use as a multi-layer bilinear CNN as described in our prior work on visualizing deep texture representations [9]. The models for texture captioning were trained on our ongoing efforts on collecting a dataset of describable textures building on the DTD dataset[6]. As seen in Figure 1, these visualizations indicate what aspects of the texture is most discriminative for each category while the descriptions provide a language-based explanation of the same.
Visualizing categories as maximal textures.
We visualize the categories from Caltech-UCSD birds [14], Oxford flowers [12], FGVC flowers [2], FGVC fungi [3] and FGVC butterflies and moths [1] datasets. Following the approach of [10] we extract the covariance matrix followed by signed square-root and normalization from relu{2_2,3_3,4_3, 5_3} layers of VGG-16 network [13] and train a softmax layer to predict class labels. We train the model on the standard training split for birds and Oxford flowers and randomly select 100 images from the 200 categories with the most images for FGVC fungi, flowers, and butterflies.
Let be the predicted probability from layer . Then the maximal inverse image for a target class is obtained as: Here is the softmax loss and is the TV norm that acts as a smoothness prior. This technique was also used to visualize inverse images in [11]. Figure 1 show the maximal images for three categories along with their texture attributes. Additional visualizations selected arbitrarily across datasets are shown in Figure 2 and 3. The maximal images indicate what discriminative texture properties are learned from training images for classification of instances which often appear in clutter, with wide ranges of pose and lighting variations, and under occlusions.
Describing maximal textures.
In addition, we provide the preliminary experiments on describing these textures using attribute phrases that provide a language-based explanation of discriminative texture properties.
We collected a new dataset with natural language descriptions of texture details based on the Describable Textures Dataset (DTD) [6]. For each image from DTD, we ask five human annotators to provide several attribute phrases (e.g., “black and white dots”, or “colorful patterns”). We trained linear classifiers based on ResNet-101 [7] activations to predict the probability of each attribute phrase on our collected dataset. For each maximal texture image, the “phrase cloud” shows the top 20 attribute phrases, with the font size proportional to the predicted probability.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] FGVC Butterflies and Moths Dataset, https://sites.google.com/view/fgvc 6/competitions/butterflies-moths-2019 .
- 2[2] FGVC Flowers Dataset, https://sites.google.com/view/fgvc 5/competitions/fgvcx/flowers .
- 3[3] FGVC Fungi Dataset https://sites.google.com/view/fgvc 5/competitions/fgvcx/fungi .
- 4[4] The Fifth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/fgvc 5 .
- 5[5] The Sixth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/fgvc 6 .
- 6[6] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.
- 7[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 770–778, 2016.
- 8[8] Peihua Li, Jiangtao Xie, Qilong Wang, and Zilin Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018.
