Information-Bottleneck Approach to Salient Region Discovery

Andrey Zhmoginov; Ian Fischer; Mark Sandler

arXiv:1907.09578·cs.CV·February 18, 2020

Information-Bottleneck Approach to Salient Region Discovery

Andrey Zhmoginov, Ian Fischer, Mark Sandler

PDF

TL;DR

This paper introduces a semi-supervised method for learning Boolean attention masks in images using the Information Bottleneck principle, effectively highlighting class-defining features in synthetic and real datasets.

Contribution

It presents a novel Boolean mask-based attention model guided by the Information Bottleneck, differing from continuous mask approaches.

Findings

01

Successfully attends to class-defining features

02

Works on synthetic and real datasets

03

Produces Boolean masks that conceal irrelevant information

Abstract

We propose a new method for learning image attention masks in a semi-supervised setting based on the Information Bottleneck principle. Provided with a set of labeled images, the mask generation model is minimizing mutual information between the input and the masked image while maximizing the mutual information between the same masked image and the image label. In contrast with other approaches, our attention model produces a Boolean rather than a continuous mask, entirely concealing the information in masked-out pixels. Using a set of synthetic datasets based on MNIST and CIFAR10 and the SVHN datasets, we demonstrate that our method can successfully attend to features known to define the image class.

Equations44

ζ min Q_{β} \equiv ζ min [β I (I ⊙ M; I) - I (I ⊙ M; C)],

ζ min Q_{β} \equiv ζ min [β I (I ⊙ M; I) - I (I ⊙ M; C)],

ζ min Q_{β} \equiv ζ min [β I (I ⊙ M; I) - I (I ⊙ M^{'}; C)] .

ζ min Q_{β} \equiv ζ min [β I (I ⊙ M; I) - I (I ⊙ M^{'}; C)] .

Q_{β} = β H (I ⊙ M) - β H (I ⊙ M ∣ I) - - H (C) + H (C ∣ I ⊙ M^{'}) .

Q_{β} = β H (I ⊙ M) - β H (I ⊙ M ∣ I) - - H (C) + H (C ∣ I ⊙ M^{'}) .

\min_{\zeta,\theta,\psi}\biggl{[}\mathbb{E}_{p(i,c)p_{\zeta}(m|i)}\biggl{(}-\beta\log g_{\theta}(i\odot m)-\\ -\log h_{\psi}(c|i\odot m^{\prime})\biggr{)}-\beta\mathbb{H}(I\odot M|I)\biggr{]},

\min_{\zeta,\theta,\psi}\biggl{[}\mathbb{E}_{p(i,c)p_{\zeta}(m|i)}\biggl{(}-\beta\log g_{\theta}(i\odot m)-\\ -\log h_{\psi}(c|i\odot m^{\prime})\biggr{)}-\beta\mathbb{H}(I\odot M|I)\biggr{]},

(i ⊙ m)_{x, y} \equiv {(i_{x, y}, 1) (0, 0) if m_{x, y} = 1, if m_{x, y} = 0.

(i ⊙ m)_{x, y} \equiv {(i_{x, y}, 1) (0, 0) if m_{x, y} = 1, if m_{x, y} = 0.

- x, y = 1 \sum n [ρ_{x, y} lo g ρ_{x, y} + (1 - ρ_{x, y}) lo g (1 - ρ_{x, y})] .

- x, y = 1 \sum n [ρ_{x, y} lo g ρ_{x, y} + (1 - ρ_{x, y}) lo g (1 - ρ_{x, y})] .

- E_{z \sim q_{ϕ} (z ∣ i ⊙ m)} [lo g g_{θ} (i ⊙ m ∣ z)] + + D_{KL} [q_{ϕ} (z ∣ i ⊙ m) ∥ p (z)],

- E_{z \sim q_{ϕ} (z ∣ i ⊙ m)} [lo g g_{θ} (i ⊙ m ∣ z)] + + D_{KL} [q_{ϕ} (z ∣ i ⊙ m) ∥ p (z)],

-\log g_{\theta}(i\odot m|z)=\sum_{x,y=1}^{n}\biggl{\{}-(1-m_{x,y})\log(1-\hat{\rho}_{x,y})-\\ -m_{x,y}\left[\log\hat{\rho}_{x,y}-\ell_{2}(i_{x,y},\hat{i}_{x,y})\right]\biggr{\}}+C,

-\log g_{\theta}(i\odot m|z)=\sum_{x,y=1}^{n}\biggl{\{}-(1-m_{x,y})\log(1-\hat{\rho}_{x,y})-\\ -m_{x,y}\left[\log\hat{\rho}_{x,y}-\ell_{2}(i_{x,y},\hat{i}_{x,y})\right]\biggr{\}}+C,

- x, y = 1 \sum n [m_{x, y} lo g ρ_{x, y} + (1 - m_{x, y}) lo g (1 - ρ_{x, y})]

- x, y = 1 \sum n [m_{x, y} lo g ρ_{x, y} + (1 - m_{x, y}) lo g (1 - ρ_{x, y})]

ζ arg min [β I (I ⊙ M; I ∣ M) - I (I ⊙ M; C ∣ M)],

ζ arg min [β I (I ⊙ M; I ∣ M) - I (I ⊙ M; C ∣ M)],

ζ arg min [β I (I ⊙ M; I ∣ M) - I (I ⊙ M; C^{'} ∣ M)] .

ζ arg min [β I (I ⊙ M; I ∣ M) - I (I ⊙ M; C^{'} ∣ M)] .

\operatorname*{arg\,min}_{\zeta}\biggl{[}\beta\mathbb{H}(I\odot M|M)+\mathbb{H}(M)-\mathbb{H}(M|I)+\\ +\mathbb{H}(I|M,C^{\prime})+\mathbb{H}(C^{\prime}|I\odot M)\biggr{]}.

\operatorname*{arg\,min}_{\zeta}\biggl{[}\beta\mathbb{H}(I\odot M|M)+\mathbb{H}(M)-\mathbb{H}(M|I)+\\ +\mathbb{H}(I|M,C^{\prime})+\mathbb{H}(C^{\prime}|I\odot M)\biggr{]}.

ζ min Q_{β^{'}}^{'} \equiv ζ min [β^{'} I (I ⊙ M; I ∣ C) - I (I ⊙ M^{'}; C)] .

ζ min Q_{β^{'}}^{'} \equiv ζ min [β^{'} I (I ⊙ M; I ∣ C) - I (I ⊙ M^{'}; C)] .

Q_{β^{'}}^{'} = β^{'} [H (I ⊙ M ∣ C) - H (I ⊙ M ∣ I)] - - H (C) + H (C ∣ I ⊙ M^{'}) .

Q_{β^{'}}^{'} = β^{'} [H (I ⊙ M ∣ C) - H (I ⊙ M ∣ I)] - - H (C) + H (C ∣ I ⊙ M^{'}) .

Q_{β^{'}}^{'} = β^{'} [H (I ⊙ M) - H (I ⊙ M ∣ I)] + + H (C ∣ I ⊙ M^{'}) + β^{'} H (C ∣ I ⊙ M) + ν .

Q_{β^{'}}^{'} = β^{'} [H (I ⊙ M) - H (I ⊙ M ∣ I)] + + H (C ∣ I ⊙ M^{'}) + β^{'} H (C ∣ I ⊙ M) + ν .

Q_{β^{'}}^{'} = (1 + β^{'}) Q_{β^{'} / (1 + β^{'})} + ν^{'} .

Q_{β^{'}}^{'} = (1 + β^{'}) Q_{β^{'} / (1 + β^{'})} + ν^{'} .

\min_{\zeta,\theta,\psi}\biggl{[}\mathbb{E}_{p(i,c)p_{\zeta}(m|i)}\biggl{(}-\beta^{\prime}\log g_{\theta}(i\odot m|c)-\\ -\log h_{\psi}(c|i\odot m^{\prime})\biggr{)}-\beta^{\prime}\mathbb{H}(I\odot M|I)\biggr{]}.

\min_{\zeta,\theta,\psi}\biggl{[}\mathbb{E}_{p(i,c)p_{\zeta}(m|i)}\biggl{(}-\beta^{\prime}\log g_{\theta}(i\odot m|c)-\\ -\log h_{\psi}(c|i\odot m^{\prime})\biggr{)}-\beta^{\prime}\mathbb{H}(I\odot M|I)\biggr{]}.

I (I ⊙ M; I ∣ M) = H (I ⊙ M ∣ M) - - H (I ⊙ M ∣ I, M) = H (I ⊙ M ∣ M) .

I (I ⊙ M; I ∣ M) = H (I ⊙ M ∣ M) - - H (I ⊙ M ∣ I, M) = H (I ⊙ M ∣ M) .

H (M ∣ I) + H (I) = H (I ∣ M, C^{'}) + H (M, C^{'})

H (M ∣ I) + H (I) = H (I ∣ M, C^{'}) + H (M, C^{'})

I (C^{'}; I ⊙ M ∣ M) = H (C^{'} ∣ M) - H (C^{'} ∣ I ⊙ M, M) = = H (C^{'}, M) - H (M) - H (C^{'} ∣ I ⊙ M)

I (C^{'}; I ⊙ M ∣ M) = H (C^{'} ∣ M) - H (C^{'} ∣ I ⊙ M, M) = = H (C^{'}, M) - H (M) - H (C^{'} ∣ I ⊙ M)

H (M ∣ I) + H (I) - H (I ∣ M, C^{'}) - H (M) - H (C^{'} ∣ I ⊙ M) .

H (M ∣ I) + H (I) - H (I ∣ M, C^{'}) - H (M) - H (C^{'} ∣ I ⊙ M) .

β I (I ⊙ M; I ∣ M) - I (C^{'}; I ⊙ M ∣ M) = = β H (I ⊙ M ∣ M) - H (M ∣ I) - H (I) + + H (I ∣ M, C^{'}) + H (M) + H (C^{'} ∣ I ⊙ M),

β I (I ⊙ M; I ∣ M) - I (C^{'}; I ⊙ M ∣ M) = = β H (I ⊙ M ∣ M) - H (M ∣ I) - H (I) + + H (I ∣ M, C^{'}) + H (M) + H (C^{'} ∣ I ⊙ M),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Information-Bottleneck Approach to Salient Region Discovery

Andrey Zhmoginov

Ian Fischer

Mark Sandler

Abstract

We propose a new method for learning image attention masks in a semi-supervised setting based on the Information Bottleneck principle. Provided with a set of labeled images, the mask generation model is minimizing mutual information between the input and the masked image while maximizing the mutual information between the same masked image and the image label. In contrast with other approaches, our attention model produces a Boolean rather than a continuous mask, entirely concealing the information in masked-out pixels. Using a set of synthetic datasets based on MNIST and CIFAR10 and the SVHN datasets, we demonstrate that our method can successfully attend to features known to define the image class.

Machine Learning, ICML

1 Introduction

Information processing in deep neural networks is carried out in multiple stages and the data-processing inequality implies that the information content of the input signal decays as it undergoes consecutive transformations. Even though this applies to both information that is relevant and irrelevant for the task at hand, in a well-trained model, most of the useful information in the signal will be preserved up to the network output. However, standard objectives, such as the cross-entropy loss, do not constrain the irrelevant information that is retained in the output.

The Information Bottleneck (IB) framework (Tishby et al., 2000; Tishby & Zaslavsky, 2015) constrains the information content retained at the output by trading off between prediction and compression: $IB\equiv\min\beta\mathbb{I}(X;Z)-\mathbb{I}(Y;Z)$ , where $X$ is the input, $Y$ is the target output, and $Z$ is the learned representation. This framework has been applied to numerous deep learning tasks including a search of compressed input representations (Alemi et al., 2017; Hjelm et al., 2018; Moyer et al., 2018), image segmentation (Bardera et al., 2009), data clustering (Strouse & Schwab, 2019; Still et al., 2003), generalized dropout (Achille & Soatto, 2018), Generative Adversarial Networks (Peng et al., 2018) and others.

In this paper, we use the IB approach to generate self-attention maps for image classification models, directing model attention away from distracting features and towards features that define the image label. The method is based on the observation that the information content of the image region that we want to “attend to” should ideally be minimized while still being descriptive of the image class.

The proposed technique can be thought of as a form of semi-supervised attention learning. The entire model consisting of the mask generator and the classifier operating on the masked regions can also be viewed as a step towards “explainable models”, which not only make predictions, but also assign importance to particular input components. This technique could potentially be useful for datasets that cannot be easily annotated by experts, such as medical image datasets where labels are known, but the particular cause of the label in the input is difficult to collect.

The paper is structured as follows. Section 2 describes prior work and relates our approach to other existing methods. In Section 3 we outline theoretical foundations of our method and the experimental results are summarized in Section 4. Section 5 discusses an alternative IB-based approach and finally, Section 6 summarizes our conclusions.

2 Prior Work

Semi-supervised image segmentation is a task of learning to identify object boundaries without access to the boundary groundtruth information. Object detection and reconstruction of object shape in the context of this task is frequently achieved based on the knowledge of image labels alone (Hou et al., 2018; Wei et al., 2017; Zhang et al., 2018; Li et al., 2018; Kolesnikov & Lampert, 2016). A successfully trained model would thus effectively “know” which parts of the input image carry information defining the image class and which parts are irrelevant. Most of these methods use (in one or another way) a signal supplied by the classification model with a partially occluded input. By changing the attention mask and probing classifier performance it is possible to identify “salient” regions, as well as those regions that are not predictive of the object present in the image.

Most semi-supervised semantic segmentation approaches including those mentioned above tend to rely on hand-designed optimization objectives and supplementary techniques that are carefully tuned to work well in specialized domains. In contrast, more general frameworks, like those based on information theory, could provide a more elegant and universal alternative. In recent years, the Information Bottleneck method has been applied to generating instance-based input attention maps. Most notably, an information-theoretic generalization of dropout called Information Dropout (Achille & Soatto, 2018) based on element-wise tensor masking was shown to successfully generate representations insensitive to nuisance factors present in the model input. Another novel approach called InfoMask recently proposed in Taghanaki et al. (2019) independently of our work, applies IB-inspired approach to generating continuous attention masks for the image classification task. The authors demonstrated superior performance of InfoMask on the Chest Disease localization task compared to multiple other existing methods.

In this work, we propose an alternative approach using the Information Bottleneck optimization objective. In contrast to two described approaches, we target the information content of the masked image and we do not multiply image pixels by a floating-point continuous mask, but instead use Boolean masks, thus completely preventing masked-out pixels from propagating any information to the model output.

3 Model

Consider a conventional image classification task trained on samples drawn from a joint distribution $p(I,C)$ with the random variable $I$ corresponding to images and $C$ being image classes. Let us tackle a complimentary task of learning a self-attention model that given an image $i$ produces such a Boolean mask $m_{\zeta}(i)$ that the masked image $i\odot m_{\zeta}(i)$ satisfies two following properties: (a) it captures as little information from the original image as possible, but (b) it contains enough information about the contents of the image for the model to predict the image class.

Using the language of information theory these two conditions can be satisfied by writing a single optimization objective:

[TABLE]

where $\beta$ is a constant and $M$ is a random mask variable governed by some learnable conditional distribution $p_{\zeta}(m|i)$ . Being written in the form of Eq. (1), our task can be seen as a reformulation of the Information Bottleneck principle (Tishby et al., 2000). Alternative optimization objectives based on, for example, Deterministic Information Bottleneck (Strouse & Schwab, 2017) could also be of interest, but fall outside of the scope of this paper. Another optimization objective based on the Conditional Entropy Bottleneck (Fischer, 2018) is discussed in Appendix A.

Notice that Equation (1) has one significant limitation: it allows the masking model to deduce the class from the image and encode this class in the mask itself.111Assuming it is sufficiently complex and has a receptive field covering the entire image. Consider a binary classification task. If the image belongs to the first class, the generated mask can be empty. On the other hand, for images belonging to the second class, the mask can be chosen to be just a single or a few pixels taken from the “low entropy” part of the image thus both minimizing $\mathbb{I}(I\odot M;I)$ and maximizing $\mathbb{I}(I\odot M;C)$ . For this choice of mask, the classifier $f_{\psi}$ can predict the label just from the mask itself.

This unwanted behavior can be avoided in practice by choosing mask models with a finite receptive field that is comparable to the size of the feature distinguishing one class from another. A more general approach has to rely on special properties of the mask $m$ . One such defining property is that $\mathbb{I}(I\odot M;C)\leq\mathbb{I}(I\odot M^{\prime};C)$ for Boolean masks $M^{\prime}$ “larger” than $M$ in a sense that $m^{\prime}_{x,y}=0$ implies that $m_{x,y}=0$ . We can define $M^{\prime}$ by, for example, specifying $p(m^{\prime}_{x,y}|m_{x,y},x,y)$ and restricting it via $p(m^{\prime}_{x,y}=0|m_{x,y}=1)=0$ . Defined like this, our optimization objective (1) can be rewritten as:

[TABLE]

Preliminary exploration of the effect that mask randomization technique has on attention regions is presented in Section 4.3.

3.1 Variational Upper Bound

Expanding the expressions for the mutual information in Eq. (2), we obtain:

[TABLE]

Entropies of the form $\mathbb{H}(A)$ permit variational upper bounds of the form $-\mathbb{E}_{a}\log p_{\phi}(a)$ with $p_{\phi}(a)$ taken from an arbitrary family of distribution functions, and similarly for conditional entropies $\mathbb{H}(A|B)$ . This allows us to formulate the variational optimization objective as (Alemi et al., 2017):

[TABLE]

where $g_{\theta}$ and $h_{\psi}$ are variational approximations of $p(i\odot m)$ and $p(c|i\odot m^{\prime})$ correspondingly. Below, we compute $\mathbb{H}(I\odot M|I)$ explicitly for our choice of mask model.

3.2 Mask and Masked Image

Let $\rho_{\zeta}:X\to\mathbb{R}^{n\times n}$ be the “masking probability” model parameterized222We will frequently be omitting $\zeta$ for brevity. by $\zeta$ . Each $\rho_{x,y}(i)$ for $1\leq x,y\leq n$ is assumed to satisfy $0\leq\rho_{x,y}(i)\leq 1$ . We introduce a discrete mask $m=\textrm{Bernoulli}(\rho)$ sampled according to $\rho$ independently for each pixel. The masked image $i\odot m$ can then be defined as follows:

[TABLE]

Given this definition, the entropy $\mathbb{H}(I\odot M|I)$ can be expressed as:

[TABLE]

It is worth noticing here that the mask $\rho_{\zeta}(i)$ can be interpreted as an adaptive “continuous” downsampling of the image. Low values of $\rho$ cause most, but not all image pixels to be removed; the remaining pixels and the mere fact that the mask chose to partially remove them can still provide enough information to the image classification model.

3.3 Loss Function

Having the expression for the last term in Eq. (4), we will now provide specific models for the first two.

Let us start with $-\log h_{\psi}(c|i\odot m^{\prime})$ . Consider a family of deep neural network models $f_{\psi}$ mapping masked images $i\odot m^{\prime}$ to $\mathbb{R}^{|c|}$ . We can define $h_{\psi}(i\odot m^{\prime})$ to be $\textrm{softmax}\,(f_{\psi}(i\odot m^{\prime}))$ allowing us to rewrite $-\log h_{\psi}(c|i\odot m^{\prime})$ as a cross-entropy loss with respect to $\textrm{softmax}\,(f_{\psi}(i\odot m^{\prime}))$ . Recalling that mask $m$ is sampled from $\textrm{Bernoulli}(\rho_{\zeta})$ , we cannot simply back-propagate gradients all the way down to the parameters of the model $\rho_{\zeta}(i)$ . We alleviate this problem by using the Gumbel-softmax reparametrization approach (Jang et al., 2016; Maddison et al., 2016), 333It is worth mentioning that the Gumbel temperature should be chosen with care; very small values lead to high-variance estimators, while low temperature would introduce bias. thus approximating $m(i)$ with a differentiable function.

Now let us consider the first term in Eq. (4). Since the space of masked images $i\odot m$ is generally very high-dimensional, we adapt the variational autoencoder approach (Kingma & Welling, 2014), considering a space of marginal distribution functions $g_{\theta}(i\odot m)=g_{\theta}(i\odot m|z)p(z)$ with $p(z)$ being a tractable prior distribution for the latent variable space $Z$ . Following Kingma & Welling (2014), $-\log g_{\theta}(i\odot m)$ can be upper bounded by:

[TABLE]

where $q_{\phi}$ is a variational approximation of $g_{\theta}(z|i\odot m)$ . The encoder $q_{\phi}$ in our model receives both the input pixels $i_{x,y}$ (or [math] if $m_{x,y}=0$ ) and the mask $m_{x,y}$ as its inputs and produces a conventional embedding $z\in\mathbb{R}^{d}$ . The decoder $g_{\theta}$ , in turn, maps $z$ back to $\hat{\rho}$ and $\hat{i}$ . In our model, we define $g_{\theta}(i\odot m|z)$ as a probability for a masked image to be sampled from a Bernoulli process with a probability $\hat{\rho}$ and the image to be sampled from a Gaussian random variable with the mean $\hat{i}$ and a constant covariance matrix. This allows us to rewrite $-\log g_{\theta}(i\odot m|z)$ as:

[TABLE]

where $\ell_{2}(i,\hat{i})={(i-\hat{i})^{2}}/{2\sigma^{2}}$ and $\sigma$ , $C$ are constants. Given this choice, $\beta$ becomes an overall multiplier of the VAE objective in the full loss and $\sigma$ defines a weight of the image pixel reconstruction relative to the mask reconstruction. The entire model is illustrated in Figure 1.

It is worth noticing that adopting the Gumbel-softmax trick we find that a discrete approximation of Eq. (6) reading

[TABLE]

leads to better convergence in our experiments. We hypothesize that better empirical performance of models using Eq. (8) rather than Eq. (6) can potentially be explained by the fact that the Gumbel-softmax reparametrization introduces bias and therefore, expression in Eq. (6) will not cancel on average with the corresponding term (7) in VAE even for the perfect mask reconstruction, i.e., $\hat{\rho}=\rho$ .

4 Experimental Results

All our experiments were conducted for the original optimization objective (2) by optimizing the loss function derived in Section 3 using Eq. (8) instead of Eq. (6). We observed that the behaviour of the model was very sensitive to the constant $\beta$ . If $\beta$ was too small, the mask $\rho_{\zeta}$ would monotonically approach $\rho_{\zeta}=1$ . Conversely, for sufficiently large $\beta$ , $\rho_{\zeta}$ would vanish. We used two different techniques to improve behaviour of our model: (i) stop masking model gradients in variational autoencoders once $-\log g_{\theta}$ falls below a certain threshold, or (ii) change $\beta$ adaptively in such a way that $-\log g_{\theta}$ stays within a pre-defined range. Both of these approaches were able to guarantee in practice that the variational autoencoder loss reached a certain predefined value. For additional details of our model, see Appendix C.

In all experiments discussed in this section, the groundtruth “features” that define image class are known in advance allowing us to interpret experimental results with ease. In a more general case, the quality of the model prediction can be judged based on the following three criteria: (a) accuracy of the trained classifier operating on masked images $I\odot M$ should be sufficiently close to the accuracy of a separate classifier trained on original images $I$ ; (b) VAE loss should fall into a predefined range; (c) the accuracy of the classifier prediction on $I\odot M^{\prime}$ should be sufficiently close to the prediction on $I\odot M$ for any fixed $I$ and all sampled realizations of $M^{\prime}$ .

In the following subsections, we first discuss our results on synthetic datasets with “anomalies” and “distractors”. These experiments were conducted without mask randomization, but we verified that experiments with mask randomization produced nearly identical results. We then discuss our experiments on a synthetic dataset designed to explore the effect that mask randomization has on produced masks. Finally, we show results on a realistic SVHN dataset with apriori known localized features defining the image class (number of digits in the image). For this dataset, mask randomization appears to play an important role.

4.1 Experiments with “Anomalies”

For the first series of experiments, we used images from CIFAR10 (Krizhevsky, 2009) and MNIST datasets augmented by adding randomly-placed rectangular “anomalies” (thus designed to be low-entropy). The anomaly was added with a probability of $1/2$ and the classification task was to distinguish original images from the altered ones.

For these datasets, our models learned to produce opaque masks for most images without anomalies. For images with anomalies, generated masks were opaque everywhere except for the regions around rectangles added into the image (see Figure 2 and Figure 3). As a result, the image classifiers reached almost perfect accuracy in both of these examples: approximately $98\%$ test and train accuracy for MNIST and approximately $99\%$ test and train accuracy for CIFAR10 dataset.

In both models, $\ell_{1}$ norm of the mask was a strong predictor of whether the “anomaly” was in the image (see Figures 10 and 11). However, interestingly, the separation was much more visible for CIFAR10, while the masks predicted for the MNIST dataset were much better aligned with the actual anomalies. The latter fact can also be seen to be reflected in the mask averages inside and outside of the actual added rectangles (see Figures 10 and 11).

We hypothesize that these properties of the trained models can be attributed to receptive fields of the masking models used in both examples. For the MNIST dataset, the masking model has a receptive field of about 40% of the image size, while for the CIFAR10 dataset, the receptive field covered nearly the entire input image.

4.2 Experiments with “Distractors”

In another set of experiments, we used two synthetic datasets based on MNIST, in which we combined: (a) two digits and (b) four digits in a single $56\times 56$ image. In both datasets, one of the digits was always smaller and it defined the class of the entire image. The larger digits are thus “distractors”.

For the vast majority of masks generated by the trained model, everything outside of the region around the small digit was masked-out (see Figures 4, 5 and 12). In some rare cases, however, generated masks were also letting some pixels of the larger digits to pass through. In most of our experiments, the classifier training and test accuracy reached $95\%$ and $90\%$ for the two- and four-digits datasets correspondingly. However, there were some runs, in which the test accuracy could be lower than the training accuracy by $10\%$ or $20\%$ . We believe this is due to the greater capacity to overfit to the training data of the combined masking and classifier models.

4.3 Mask Randomization Experiments

We identified a simple MNIST-based synthetic example, in which it can be clearly seen that without mask randomization, generated masks can encode class information without using virtually any pixels from the actual digits. In our example, we use 5 MNIST digits (0 through 4) and add 4 solid rectangles (“anchors”) into the image thus allowing the mask to use them for encoding image label. Model trained without any mask randomization, i.e., $M^{\prime}=M$ can be seen to produce attention regions selecting anchors, but frequently avoiding actual digit pixels altogether (see Figure 6(a)). Trained classifier has almost perfect accuracy ( $\sim 99\%$ ) on original masked images. However, once we start evaluating the same classifier on images with randomized masks (adding random transparent rectangular patches), the accuracy drops down to $\sim 33\%$ for some of the digits. After the classifier is fine-tuned on images with randomized masks, the lowest accuracy for a digit goes up to $70.3\%$ (for digit $2$ , which ends up being most frequently confused for $3$ ).

We then conduct experiments with the same dataset and enable mask randomization during training (by selecting $M^{\prime}$ to be equal to $M$ with a randomly placed transparent rectangle). New trained models now mainly concentrate on the digit pixels and seem to select discriminative parts of the image (see Figure 6(b)). Evaluating the accuracy of this classifier with mask randomization, we observe that the average accuracy now stays above $93\%$ for all digits.

4.4 SVHN Experiments

We chose the original SVHN dataset (Netzer et al., 2011) for our experiments with realistic images. The task given to a classifier was to predict the number of digits in the street/house number shown in the image. With this task, the generated mask was expected to concentrate on areas of the image containing numbers.

We started our experiments without mask randomization, i.e., $m^{\prime}=m$ . We picked $\sigma=(1/8)^{1/2}$ and the target VAE loss objective was chosen in such a way that the mask was neither transparent, nor almost completely opaque. For intermediate values of the VAE objective, most of the observed solutions produced noticeable peaks of transparency around the digits. Results obtained for one of the models trained with sufficiently low VAE target are shown in Figure 7.

For lower VAE loss targets, we frequently observed masks that used interleaved transparent and opaque lines (either vertical or horizontal) as means of minimizing VAE loss while still allowing the classifier to achieve high accuracy in predicting the number of digits in the image.

For even lower VAE thresholds, generated masks were no longer transparent around the digits, but instead were mostly opaque in these areas. This behavior can be understood by noticing that the digits containing many complex sharp edges may carry more information444also poorly approximated by VAEs, which tend to favor smooth reconstructions than relatively featureless surrounding areas. In essence, the “negative space” outside of the number bounding box may be smaller-entropy, but its shape may still be enough to determine the number of digits in the image. In this case, the mask itself became a feature strongly correlated with the image label. Plotting histograms for $\ell_{1}$ mask norm, we observed that masks generated for $1$ -digit images were almost entirely transparent while the masks generated for $4$ -digit images were mostly opaque. The histogram of $\ell_{1}$ mask norm for different image labels is shown in Figure 8. We verified that using the $\ell_{1}$ mask norm alone, we could reach a $59\%$ accuracy on the image classification task, just $3\%$ lower than the actual trained classifier receiving the masked image.

If all digits had the same aspect ratio, attending to the “negative space” of the number could actually be a reasonable solution satisfying all conditions outlined in Section 3. In a more general case, observed masks that simply encode image class information do not seem to satisfy the condition $\mathbb{I}(I\odot M;C)\leq\mathbb{I}(I\odot M^{\prime};C)$ . Implementing mask randomization by adding randomly-placed transparent rectangles to $m$ , we verified that newly trained masking models were now nearly always concentrating on digits rather than the “negative space”.

5 Alternative Approach based on Conditional Mutual Information

In previous sections, we showed that the Information Bottleneck optimization objective (1) allows for the class information to be encoded in the mask itself. Previously, we used mask randomization to address this issue. Another approach to disallowing the generated mask to encode class information is based on modifying the Information Bottleneck objective by replacing $\mathbb{I}(I\odot M;C)$ with $\mathbb{I}(I\odot M;C|M)$ thus leading to the optimization objective:

[TABLE]

where we also chose to minimize $\mathbb{I}(I\odot M;I|M)$ instead of $\mathbb{I}(I\odot M;I)$ for consistency. Conditioning on the mask implies that for any realization of the mask, masked pixels should contain the entirety of the information about the image class. If, for example, the image class could be inferred just from the mask, the conditional mutual information $\mathbb{I}(I\odot M;C|M)$ would vanish.

In order to optimize this objective, we have to modify Eq. (9) by introducing a function $c^{\prime}(i)$ that is chosen to approximate the groundtruth label $C$ :

[TABLE]

The exact form of $c^{\prime}(i)$ will prove to be unimportant and in practice we frequently chose actual labels for our experiments assuming that the perfect groundtruth model $I\to C$ exists.

As shown in Appendix B, the optimization problem (10) is equivalent to:

[TABLE]

Following our earlier discussion, we can then explicitly calculate $\mathbb{H}(M|I)$ and use variational upper bounds for all remaining entropies and conditional entropies. The complete model will therefore include: (a) VAE on the masked portion of the image $I\odot M$ conditioned on the image mask $M$ ; (b) VAE for the mask $M$ itself, (c) VAE auto-encoding the image $I$ and conditioned on the mask $M$ and the class approximation $C^{\prime}$ and (d) image classifier with $I\odot M$ as its input. Notice that it is the conditional entropy $\mathbb{H}(I|M,C^{\prime})$ that is responsible for disentangling $M$ and $C^{\prime}$ . Indeed, trying to minimize the entropy of images conditioned on the mask and the image class, we effectively reward the mask for containing information from $I$ that is not encoded in $C^{\prime}$ .

In our first preliminary experiments, we trained the upper-bound model for (11) on the MNIST-based synthetic datasets. For the dataset with “anchors”, the model was able to generate masks that were: (a) covering the digits, (b) allowing the classification model to achieve $94\%$ accuracy and (c) nearly independent of the image label (see Figure 9), which is exactly what objective (11) was designed to achieve. Similarly, for the dataset with distractors, the generated masks were almost indistinguishable from those shown in Figure 5.

For the dataset with anomalies, the experiments based on Eq. (11) failed to identify a proper mask and instead produced a mask transparent at the boundary and almost entirely opaque at the image center. Average masking probability $\langle\rho\rangle$ in the center encoded information about the image class so that the classifier could (with accuracy close to $100\%$ ) predict the presence of anomaly by just averaging values of visible pixels. This failure is not surprising if you notice that a mask transparent near an anomaly (see for example Figure 2), but opaque for an image without one, does not optimize objective (11). Indeed, given such a mask, one would be able to predict image class by just looking at the mask itself. The optimal mask would have to have shape and location independent of the image class and concentrate on anomaly if it is present in the image. Instead of finding this complex solution, our model identified a simpler one by producing a mask that is almost independent of the image class, but still conveys enough information about the presence of anomaly in the image.

Overall, while being conceptually sound, objective Eq. (11) is much more complex than Eq. (3) making it potentially less effective in practice. The disentangelement of $M$ and $C^{\prime}$ critically relies on the upper bound for $\mathbb{H}(I|M,C^{\prime})$ to be sufficiently tight and we suspect that it may be difficult to achieve this in practice for complex datasets containing realistic images. More complex density estimation models could, however, alleviate this problem.

6 Conclusions

In this work, we propose a novel universal semi-supervised attention learning approach based on the Information Bottleneck method. Supplied with a set of labeled images, the model is trained to generate discrete attention masks that occlude irrelevant portions of the image, but leave enough information for the classifier to correctly predict the image class. Using synthetic and real datasets based on MNIST, CIFAR10, and SVHN, we demonstrate that this technique can be used to identify image regions carrying information that defines the image class. In some special cases when the feature itself is high-entropy (for example, digits in SVHN images), but its shape is sufficient to determine the image class (number of digits in our SVHN example), we show that the generated mask may occlude the feature and use its “negative space” instead. Additionally, we identify a potential failure of this approach, in which the generated mask acts not as an attention map, but rather as an encoding of the image class itself. We then propose two techniques based on finite receptive fields and mask randomization that mitigate this problem. We believe this technique is a promising method to train explainable models in a semi-supervised manner.

Appendix A Relation to Conditional Entropy Bottleneck

There is an alternative information-theoretic optimization objective that looks similar to Eq. (2), but is based on Conditional Entropy Bottleneck (CEB) (Fischer, 2018) instead:

[TABLE]

Here the mutual information between $I$ and $I\odot M$ is conditioned on the class label variable $C$ . Just like $Q$ , new objective $Q^{\prime}$ can be rewritten as:

[TABLE]

Rewriting $\mathbb{H}(I\odot M|C)$ as $\mathbb{H}(C|I\odot M)+\mathbb{H}(I\odot M)-\mathbb{H}(C)$ and recalling that $\mathbb{H}(C)$ is a constant, $Q^{\prime}$ can be expressed as:

[TABLE]

where $\nu$ is a constant. Notice that without mask randomization when $M^{\prime}=M$ , this expression can be further rewritten as:

[TABLE]

with $\nu^{\prime}$ being a new constant. This suggests that for $M^{\prime}=M$ the original optimization objective (2) with $0\leq\beta<1$ is equivalent to a CEB-based optimization objective (12) with $\beta^{\prime}=\beta/(1-\beta)$ .

By analogy with Eq. (4), the upper bound for Eq. (12) can be written as:

[TABLE]

The only difference from Eq. (4) is in the fact that the probabilistic model $g_{\theta}^{(c)}(i\odot m)=g_{\theta}(i\odot m|c)$ now depends on the sample class $c$ .

Even though in our experiments we used the original optimization objective (2), CEB-based objective (12) and the corresponding upper bound (13) may have a practical advantage when the class-dependent variational approximations $g_{\theta}^{(c)}(i\odot m)$ provide a tighter bound than the class-agnostic model $g_{\theta}(i\odot m)$ .

Appendix B Derivation of Equation (11)

First, following the definition of the conditional mutual information:

[TABLE]

Next, noticing that $\mathbb{H}(M,I)=\mathbb{H}(M,I,C^{\prime})$ , we obtain:

[TABLE]

and therefore

[TABLE]

can be rewritten as:

[TABLE]

Combining all terms together we obtain:

[TABLE]

where $\mathbb{H}(I)$ is a constant.

Appendix C Model Details

In our experiments, all of the model components including the classifier, mask generator and VAE encoder/decoder were based on convolutional neural networks.

C.1 Notation

In the following, we use a simplified notation for writing down simple convolutional network architectures. Convolutional operation is denoted as $\mathrm{C}(k,s,d)$ (default padding is valid; subscript $s$ indicates same padding), where $k$ is the kernel size, $s$ is the stride and $d$ is the number of output channels. Image resizing is denoted by $\mathrm{Resize}(s)$ with $s$ being the new size and $\mathrm{Shape}(s)$ is tensor reshaping. Similarly, $\mathrm{T}(k,s,d)$ is the transpose convolution, $\mathrm{Pad}(x)$ is image padding, $\mathrm{Avg}$ is the average pooling operator and $\mathrm{FC}(d)$ is the fully-connected layer mapping its input to a vector of size $d$ . The architecture is then represented as a sequence of operations separated by $\rightarrow$ , i.e., $a\rightarrow b=b\circ a$ .

C.2 MNIST

For this dataset, we used original MNIST images with randomly-placed rectangles. Rectangles were placed randomly and their size varied from 3 to 10 pixels. The color was randomly chosen from a range $[100,255]$ . The VAE loss target was set at $12$ .

•

Classifier architecture: $\mathrm{C}(1,1,4)\rightarrow\mathrm{C}(3,2,4)\rightarrow\mathrm{C}(1,1,8)\rightarrow\mathrm{C}(3,2,8)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{Avg}\rightarrow\mathrm{FC}(2)$ with ReLU6 nonlinearities.

•

Mask architecture: $[\mathrm{C}(1,1,4)\rightarrow\mathrm{C}(3,2,4)\rightarrow\mathrm{C}(1,1,8)\rightarrow\mathrm{C}(3,2,8)]\rightarrow[\mathrm{Resize}(12)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{Pad}(1)\rightarrow\mathrm{Resize}(28)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{C}(1,1,1)]$ with the subnetwork in the first half using ReLU6 and the network in the second half using Leaky ReLU.

•

Encoder architecture: $\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(3,1,16)$ with Leaky ReLU nonlinearities.

•

Decoder architecture: $\mathrm{FC}(24)\rightarrow\mathrm{FC}(49)\rightarrow\mathrm{Shape}(7\times 7)\rightarrow\mathrm{T}_{s}(3,2,16)\rightarrow\mathrm{T}_{s}(3,1,16)\rightarrow\mathrm{T}_{s}(3,2,16)\rightarrow\mathrm{C}(1,1,2)$ with Leaky ReLU nonlinearities.

C.3 CIFAR10

For this dataset, we used original MNIST images with randomly-placed rectangles. Rectangles were placed randomly and their size varied from 3 to 10 pixels. The color was chosen at random (with RGB components ranging from [math] to $255$ ). The VAE loss target was set at $25$ .

•

Classifier architecture: $\mathrm{C}(1,1,8)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{C}(3,2,32)\rightarrow\mathrm{C}(1,1,32)\rightarrow\mathrm{C}(3,2,48)\rightarrow\mathrm{C}(1,1,48)\rightarrow\mathrm{Avg}\rightarrow\mathrm{FC}(2)$ with ReLU6 nonlinearities.

•

Mask architecture: $[\mathrm{C}(1,1,8)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{C}(3,2,32)]\rightarrow[\mathrm{C}_{s}(3,1,16)\rightarrow\mathrm{Resize}(10)\rightarrow\mathrm{C}_{s}(3,1,16)\rightarrow\mathrm{Resize}(16)\rightarrow\mathrm{C}_{s}(3,1,8)\rightarrow\mathrm{Resize}(32)\rightarrow\mathrm{C}_{s}(3,18)\rightarrow\mathrm{C}(1,1,1)]$ with the subnetwork in the first half using ReLU6 and the network in the second half using Leaky ReLU.

•

Encoder architecture: $\mathrm{C}(3,2,8)\rightarrow\mathrm{C}(3,2,8)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(3,1,16)$ with Leaky ReLU nonlinearities.

•

Decoder architecture: $\mathrm{FC}(64)\rightarrow\mathrm{FC}(128)\rightarrow\mathrm{Shape}(8\times 8)\rightarrow\mathrm{T}_{s}(3,2,16)\rightarrow\mathrm{T}_{s}(3,1,16)\rightarrow\mathrm{T}_{s}(3,2,16)\rightarrow\mathrm{T}_{s}(3,1,4)$ with Leaky ReLU nonlinearities.

C.4 Multiple MNIST Digits

All synthetic multi-digit images were generated by placing 2 or 4 digits into the quadrants and randomly shifting them by at most 4 pixels. For the two- and four-digit datasets, the small digit was downsampled to $18\times 18$ and $14\times 14$ correspondingly. The target VAE loss was set at $50$ .

•

Classifier architecture: $\mathrm{C}(1,1,4)\rightarrow\mathrm{C}_{s}(3,2,4)\rightarrow\mathrm{C}(1,1,8)\rightarrow\mathrm{C}(3,2,8)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{Avg}\rightarrow\mathrm{FC}(10)$ with ReLU6 nonlinearities.

•

Mask architecture: $[\mathrm{C}(1,1,4)\rightarrow\mathrm{C}_{s}(3,2,4)\rightarrow\mathrm{C}(1,1,8)\rightarrow\mathrm{C}(3,2,8)]\rightarrow[\mathrm{Resize}(12)\rightarrow\mathrm{C}_{s}(3,1,16)\rightarrow\mathrm{Pad}(1)\rightarrow\mathrm{Resize}(28)\rightarrow\mathrm{C}_{s}(3,1,16)\rightarrow\mathrm{Resize}(56)\rightarrow\mathrm{C}(1,1,16)\rightarrow\mathrm{C}(1,1,1)$ with the subnetwork in the first half using ReLU6 and the network in the second half using Leaky ReLU.

•

Encoder architecture: $\mathrm{C}_{s}(3,2,16)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(3,2,16)\rightarrow\mathrm{C}(3,1,8)$ with Leaky ReLU nonlinearities.

•

Decoder architecture: $\mathrm{FC}(24)\rightarrow\mathrm{FC}(49)\rightarrow\mathrm{Shape}(7\times 7)\rightarrow\mathrm{T}_{s}(3,2,16)\rightarrow\mathrm{T}_{s}(3,1,16)\rightarrow\mathrm{T}_{s}(3,2,8)\rightarrow\mathrm{T}_{s}(3,2,4)\rightarrow\mathrm{C}(1,1,2)$ with Leaky ReLU nonlinearities.

C.5 SVHN

All SVHN images were cropped and down- up-sampled to $128\times 128$ . We used the Inception-based image augmentation technique leaving at least $95\%$ of the entire number bounding box within the frame. The image transformation was not permitted to generate a crop containing less than $40\%$ of the original image. The VAE loss target was chosen to be at $2000$ and $\sigma=(1/8)^{1/2}$ .

•

Classifier architecture: $\mathrm{C}_{s}(3,1,4)\rightarrow\mathrm{C}_{s}(3,2,4)\rightarrow\mathrm{C}_{s}(3,1,4)\rightarrow\mathrm{C}_{s}(3,2,4)\rightarrow\mathrm{C}_{s}(3,1,4)\rightarrow\mathrm{C}_{s}(3,2,8)\rightarrow\mathrm{C}_{s}(3,1,8)\rightarrow\mathrm{C}_{s}(3,2,8)\rightarrow\mathrm{C}_{s}(3,1,8)\rightarrow\mathrm{C}_{s}(3,2,8)$ with ReLU6 nonlinearities.

•

Mask architecture: $[\mathrm{C}_{s}(3,1,4)\rightarrow\mathrm{C}_{s}(3,2,4)\rightarrow\mathrm{C}(1,1,8)\rightarrow\mathrm{C}(3,2,8)]\rightarrow[\mathrm{C}_{s}(3,1,8)\rightarrow\mathrm{C}_{s}(5,1,8)\rightarrow\mathrm{Resize}(16)\rightarrow\mathrm{C}_{s}(5,1,8)\rightarrow\mathrm{C}_{s}(5,1,8)\rightarrow\mathrm{Resize}(32)\rightarrow\mathrm{C}_{s}(3,1,4)\rightarrow(1,1,1)\rightarrow\mathrm{Resize}(128)$ with the subnetwork in the first half using ReLU6 and the network in the second half using Leaky ReLU.

•

Encoder architecture: $\mathrm{C}_{s}(3,2,8)\rightarrow\mathrm{C}_{s}(3,1,8)\rightarrow\mathrm{C}_{s}(3,2,16)\rightarrow\mathrm{C}_{s}(3,1,16)\rightarrow\mathrm{C}_{s}(3,2,16)\rightarrow\mathrm{C}_{s}(3,1,16)\rightarrow\mathrm{C}_{s}(3,2,16)\rightarrow\mathrm{C}_{s}(3,2,32)\rightarrow\mathrm{C}_{s}(3,1,32)$ with Leaky ReLU nonlinearities.

•

Decoder architecture: $\mathrm{FC}(64)\rightarrow\mathrm{FC}(128)\rightarrow\mathrm{Shape}(8\times 8)\rightarrow\mathrm{T}_{s}(3,2,16)\rightarrow\mathrm{T}_{s}(3,1,16)\rightarrow\mathrm{T}_{s}(3,2,16)\rightarrow\mathrm{T}_{s}(3,1,16)\rightarrow\mathrm{T}_{s}(3,2,8)\rightarrow\mathrm{T}_{s}(3,1,8)\rightarrow\mathrm{T}_{s}(3,2,4)$ with Leaky ReLU nonlinearities.

Appendix D Supplementary Figures

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achille & Soatto (2018) Achille, A. and Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. , 40(12):2897–2905, 2018. doi: 10.1109/TPAMI.2017.2784440 .
2Alemi et al. (2017) Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . Open Review.net, 2017.
3Bardera et al. (2009) Bardera, A., Rigau, J., Boada, I., Feixas, M., and Sbert, M. Image segmentation using information bottleneck method. IEEE Trans. Image Processing , 18(7):1601–1612, 2009. doi: 10.1109/TIP.2009.2017823 .
4Bengio et al. (2018) Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.). Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada , 2018.
5Fischer (2018) Fischer, I. The Conditional Entropy Bottleneck, 2018. URL openreview.net/forum?id=rk VO Xh Aq Y 7 .
6Hjelm et al. (2018) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. Co RR , abs/1808.06670, 2018.
7Hou et al. (2018) Hou, Q., Jiang, P., Wei, Y., and Cheng, M. Self-erasing network for integral object attention. In Bengio et al. ( 2018 ) , pp. 547–557.
8Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In 6th International Conference on Learning Representations, ICLR 2017 , 2016.