Visualizing and Describing Fine-grained Categories as Textures

Tsung-Yu Lin; Mikayla Timm; Chenyun Wu; Subhransu Maji

arXiv:1907.05288·cs.CV·July 12, 2019

Visualizing and Describing Fine-grained Categories as Textures

Tsung-Yu Lin, Mikayla Timm, Chenyun Wu, Subhransu Maji

PDF

Open Access

TL;DR

This paper explores how fine-grained visual categories can be characterized by their textures through visualization and automatic description, enhancing understanding of subtle differences in species classification.

Contribution

It introduces a method to visualize and describe categories in FGVC using texture-based deep networks and a new dataset for texture captioning.

Findings

01

Texture-based models highlight discriminative features.

02

Automatic texture descriptions provide language explanations.

03

Visualizations improve interpretability of fine-grained categories.

Abstract

We analyze how categories from recent FGVC challenges can be described by their textural content. The motivation is that subtle differences between species of birds or butterflies can often be described in terms of the texture associated with them and that several top-performing networks are inspired by texture-based representations. These representations are characterized by orderless pooling of second-order filter activations such as in bilinear CNNs and the winner of the iNaturalist 2018 challenge. Concretely, for each category we (i) visualize the "maximal images" by obtaining inputs x that maximize the probability of the particular class according to a texture-based deep network, and (ii) automatically describe the maximal images using a set of texture attributes. The models for texture captioning were trained on our ongoing efforts on collecting a dataset of describable textures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Digital Imaging for Blood Diseases

Full text

Visualizing and Describing Fine-grained Categories as Textures

Tsung-Yu Lin Mikayla Timm Chenyun Wu Subhransu Maji

University of Massachusetts, Amherst

{tsungyulin,mtimm,chenyun,smaji}@cs.umass.edu

We analyze how categories from recent FGVC challenges [4, 5] can be described by their textural content. The motivation is that subtle differences between species of birds or butterflies can often be described in terms of the texture associated with them and that several top-performing networks are inspired by texture-based representations. These representations are characterized by orderless pooling of second-order filter activations such as in bilinear CNNs [10] and the winner of the iNaturalist 2018 challenge [8].

Concretely, for each category we (i) visualize the “maximal images” by obtaining inputs $\mathbf{x}$ that maximize the probability of the particular class according to a texture-based deep network $C_{\theta}(\mathbf{x})$ , and (ii) automatically describe the maximal images using a set of texture attributes. We use $C_{\theta}$ as a multi-layer bilinear CNN as described in our prior work on visualizing deep texture representations [9]. The models for texture captioning were trained on our ongoing efforts on collecting a dataset of describable textures building on the DTD dataset[6]. As seen in Figure 1, these visualizations indicate what aspects of the texture is most discriminative for each category while the descriptions provide a language-based explanation of the same.

Visualizing categories as maximal textures.

We visualize the categories from Caltech-UCSD birds [14], Oxford flowers [12], FGVC flowers [2], FGVC fungi [3] and FGVC butterflies and moths [1] datasets. Following the approach of [10] we extract the covariance matrix followed by signed square-root and $\ell_{2}$ normalization from relu{2_2,3_3,4_3, 5_3} layers of VGG-16 network [13] and train a softmax layer to predict class labels. We train the model on the standard training split for birds and Oxford flowers and randomly select 100 images from the 200 categories with the most images for FGVC fungi, flowers, and butterflies.

Let $C_{i}$ be the predicted probability from layer $i$ . Then the maximal inverse image for a target class $\hat{C}$ is obtained as: $\min_{\mathbf{x}}\sum_{i=1}^{m}L\left(C_{i},\hat{C}\right)+\gamma\Gamma(\mathbf{x}).$ Here $L$ is the softmax loss and $\Gamma(\mathbf{x})$ is the TV norm that acts as a smoothness prior. This technique was also used to visualize inverse images in [11]. Figure 1 show the maximal images for three categories along with their texture attributes. Additional visualizations selected arbitrarily across datasets are shown in Figure 2 and 3. The maximal images indicate what discriminative texture properties are learned from training images for classification of instances which often appear in clutter, with wide ranges of pose and lighting variations, and under occlusions.

Describing maximal textures.

In addition, we provide the preliminary experiments on describing these textures using attribute phrases that provide a language-based explanation of discriminative texture properties.

We collected a new dataset with natural language descriptions of texture details based on the Describable Textures Dataset (DTD) [6]. For each image from DTD, we ask five human annotators to provide several attribute phrases (e.g., “black and white dots”, or “colorful patterns”). We trained linear classifiers based on ResNet-101 [7] activations to predict the probability of each attribute phrase on our collected dataset. For each maximal texture image, the “phrase cloud” shows the top 20 attribute phrases, with the font size proportional to the predicted probability.

Bibliography14

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] FGVC Butterflies and Moths Dataset, https://sites.google.com/view/fgvc 6/competitions/butterflies-moths-2019 .
2[2] FGVC Flowers Dataset, https://sites.google.com/view/fgvc 5/competitions/fgvcx/flowers .
3[3] FGVC Fungi Dataset https://sites.google.com/view/fgvc 5/competitions/fgvcx/fungi .
4[4] The Fifth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/fgvc 5 .
5[5] The Sixth Fine-Grained Visual Categorization (FGVC) Workshop https://sites.google.com/view/fgvc 6 .
6[6] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.
7[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 770–778, 2016.
8[8] Peihua Li, Jiangtao Xie, Qilong Wang, and Zilin Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018.