COMIC: Towards A Compact Image Captioning Model with Attention

Jia Huei Tan; Chee Seng Chan; Joon Huang Chuah

arXiv:1903.01072·cs.CV·June 13, 2019

COMIC: Towards A Compact Image Captioning Model with Attention

Jia Huei Tan, Chee Seng Chan, Joon Huang Chuah

PDF

2 Repos

TL;DR

COMIC introduces a compact image captioning model that maintains high performance while significantly reducing vocabulary size, making it suitable for embedded systems.

Contribution

The paper proposes a novel compact image captioning model, COMIC, which achieves comparable results to state-of-the-art methods with a much smaller vocabulary size.

Findings

01

Achieves similar performance to state-of-the-art models on MS-COCO and InstaPIC-1.1M datasets.

02

Vocabulary size is reduced by 39x to 99x without sacrificing accuracy.

03

Demonstrates the feasibility of deploying image captioning models on resource-limited devices.

Abstract

Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be deployed on embedded system with limited hardware resources. This is because the size of word and output embedding matrices grow proportionally with the size of vocabulary, adversely affecting the compactness of these networks. To address this limitation, this paper introduces a brand new idea in the domain of image captioning. That is, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed that, our proposed model, named COMIC for COMpact Image Captioning, achieves comparable results in five common evaluation metrics with state-of-the-art approaches on both MS-COCO and InstaPIC-1.1M datasets despite…

Tables1

Table 1. TABLE I: Comparison of models with different tokenisation and encoding schemes on MS-COCO

Tokens	# params.	B-1	B-2	B-3	B-4	M	R	C	S
Character	4.5 M	0.670	0.498	0.364	0.266	0.220	0.495	0.770	0.149
Word	12.2 M	0.704	0.533	0.397	0.295	0.235	0.517	0.880	0.165
Radix, base-64	4.5 M	0.693	0.517	0.380	0.280	0.229	0.507	0.824	0.155
Radix, base-128	4.6 M	0.694	0.522	0.386	0.287	0.233	0.509	0.848	0.159

Equations22

lo g p (S ∣ I) = t = 0 \sum L lo g p (S_{t} ∣ I, S_{0 : t - 1}, c_{t})

lo g p (S ∣ I) = t = 0 \sum L lo g p (S_{t} ∣ I, S_{0 : t - 1}, c_{t})

h_{t = - 1} = W_{I} tanh (L N (I_{e mb e d}))

h_{t = - 1} = W_{I} tanh (L N (I_{e mb e d}))

p_{t} = S o f t ma x (E_{o} h_{t})

p_{t} = S o f t ma x (E_{o} h_{t})

h_{t}, m_{t} = L S T M (x_{t}, h_{t - 1}, m_{t - 1})

h_{t}, m_{t} = L S T M (x_{t}, h_{t - 1}, m_{t - 1})

x_{t} = [E_{w} S_{t - 1}, c_{t}]

x_{t} = [E_{w} S_{t - 1}, c_{t}]

c_{t} = j \sum ∣ F ∣ (α_{t j} ⊙ f_{j})

c_{t} = j \sum ∣ F ∣ (α_{t j} ⊙ f_{j})

α_{t j} = \frac{exp ( M L P ( f _{j} , h _{t - 1} ) / ϵ )}{\sum _{j}^{∣ F ∣} exp ( M L P ( f _{j} , h _{t - 1} ) / ϵ )}

α_{t j} = \frac{exp ( M L P ( f _{j} , h _{t - 1} ) / ϵ )}{\sum _{j}^{∣ F ∣} exp ( M L P ( f _{j} , h _{t - 1} ) / ϵ )}

M L P (f_{j}, h_{t - 1}) = W_{M 2} tanh (L N (W_{M 0} f_{j} + W_{M 1} h_{t - 1}))

M L P (f_{j}, h_{t - 1}) = W_{M 2} tanh (L N (W_{M 0} f_{j} + W_{M 1} h_{t - 1}))

c_{t} = j \sum ∣ F ∣ (α_{t j} ⊙ W_{f} f_{j})

c_{t} = j \sum ∣ F ∣ (α_{t j} ⊙ W_{f} f_{j})

c_{t} = j \sum ∣ F ∣ (α_{t j} ⊙ W_{M 0} f_{j})

c_{t} = j \sum ∣ F ∣ (α_{t j} ⊙ W_{M 0} f_{j})

L (I, S) = - t \sum L lo g p_{t} (S_{t}) + j \sum ∣ F ∣ (1 - t \sum L α_{t j})^{2} + λ \cdot ∥ θ ∥_{2}^{2}

L (I, S) = - t \sum L lo g p_{t} (S_{t}) + j \sum ∣ F ∣ (1 - t \sum L α_{t j})^{2} + λ \cdot ∥ θ ∥_{2}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

COMIC: Towards A Compact

Image Captioning Model with Attention

Jia Huei Tan, Chee Seng Chan, , and Joon Huang Chuah Manuscript received July 05, 2018; revised December 22, 2018; accepted on February 22, 2019. This research is supported by the UM Frontier Research Grant FG002-17AFR, from University of Malaya. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. El Saddik, Abdulmotaleb. *(Corresponding author: Chee Seng Chan)*J.H. Tan and C.S. Chan are with the Center of Image and Signal Processing, Department of Artificial Intelligence, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, 50603 MALAYSIA. e-mail: {[email protected]; [email protected]}J.H. Chuah is with the Department of Electrical Engineering, Faculty of Engineering, University of Malaya, Kuala Lumpur, 50603 MALAYSIA. e-mail: {[email protected]}

Abstract

Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be deployed on embedded system with limited hardware resources. This is because the size of word and output embedding matrices grow proportionally with the size of vocabulary, adversely affecting the compactness of these networks. To address this limitation, this paper introduces a brand new idea in the domain of image captioning. That is, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed that, our proposed model, named COMIC for COMpact Image Captioning, achieves comparable results in five common evaluation metrics with state-of-the-art approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an embedding vocabulary size that is 39 $\times$ - 99 $\times$ smaller. The source code and models are available at: https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-Attention

Index Terms:

image captioning, deep compression network, deep learning

I Introduction

Automatically generating a caption that describes an image, a problem known as image captioning, is a challenging problem where computer vision meets natural language processing. Compared to image classification and object recognition tasks, image captioning requires a higher level of scene understanding as well as language modelling. A well performing model not only has to identify the objects in the image, but also capture the semantic relationship between them, general context and the activities that they are involved in. Furthermore, the model has to map the visual representation into a fully-formed English sentence.

Given the many similarities shared between image captioning and neural translation, many recent approaches in the image captioning domain have been inspired by the advances in neural translation [1, 2, 3]. A common framework is to use a word embedding matrix to produce a word embedding vector to serve as the input, and a separate output projection matrix to produce a probability distribution over all the words. However we found out that when the datasets used to train the models grow larger in size, so does the vocabulary size. These huge embedding matrices in turn inflate the model, adversely affecting the compactness of the models. As a result of that, it makes them difficult to be deployed on embedded system with limited hardware resources. For example, the Recurrent Neural Network (RNN) decoder in the Show, Attend and Tell framework [4] has a vocabulary size of $9,962$ . The resulting model has $12.2M$ parameters where $7.7M$ belongs to the word embedding and output projection matrices. Even with embeddings weight sharing [5], the model still has $7.3M$ parameters where $2.6M$ belongs to the embedding matrices. On the other hand, character-based models although compact with a small vocabulary size, suffers from poor performance. This is because character-based text sequences usually have much longer sequence lengths which exacerbates difficulties with long-range dependencies.

In this paper, our goals are to i) reduce the complexity of image captioning model without compromising the performance; and at the same time ii) improve model performance with attention module without incurring additional computational costs, paving the way for possible real-time applications deployment in resource constrained devices such as embedded/mobile devices. To achieve this goal, we present a simple yet effective framework named COMIC to reduce the model complexity in a manner that preserves the original accuracy; and at the same time increase model accuracy with attention module in a manner that preserves the model complexity.

Firstly, Radix Encoding is employed as a pre-processing step that allows us to encode a vocabulary of size $v^{d}$ using $v$ symbols. The encoding scheme is designed in such a way that it can be deployed without requiring any changes to the existing image captioning models. Secondly, attention module for image captioning has become the de facto standard nowadays largely due to the success of [4, 7, 8, 9, 10]. However, it is known that attention module usually operates on the high level Convolutional Neural Network (CNN) feature maps that come with a large channel dimension, leading to an increase in the model complexity in terms of RNN input size and their weight matrices. To combat this, we refine the feature map projection weight tying as a down-projection so that the new projected feature map has lesser channels, and thus it provides a more compact representation via attention. Finally, we adopted multi-head additive attention to take advantage of improving the effective resolution of attention module without affecting the original model complexity. With this, COMIC will have the ability to jointly attend to information from different representation subspaces at different positions. Technically, this is achieved by separating the feature map channels into groups, so that each attention head can attend to different parts of the image separately depending on the channel group. In order to prevent an increase in the computational cost due to the multi-head module, the dimensionality of each attention head is reduced by a factor of $g$ , where $g$ is the number of attention heads.

In summary, the core contributions of this work are twofold. Firstly, we propose COMIC, a COMpact Image Captioning model with vastly reduced vocabulary size (up to 99 $\times$ smaller) and multi-head attention module (see Section IV). This is the first attempt in the image captioning domain, and it opens up a new research angle in this domain. Secondly, we demonstrate the effectiveness of COMIC on two benchmark datasets: MS-COCO [11] and InstaPIC-1.1M [12] (see Section VI-A). We show that COMIC achieves comparable results ( $\leq 1-2\%$ loss only) on BLEU [13], METEOR [14], ROUGE-L [15], CIDEr [16] and SPICE [17] against state-of-the-art (SOTA) methods despite having an embedding vocabulary size that is 39 $\times$ - 99 $\times$ smaller. We discuss the technical differences as compared to some related works in the next section.

II Related Works

Our work is mostly related to the current research on image captioning and compact model. This section reviews the most relevant works on these two topics.

Image captioning. [18] proposed a multimodal log-bilinear model to generate image captions, while [6] used a Bidirectional RNN (BRNN) and Region CNN (R-CNN) to learn multimodal embedding which is then used by an RNN to generate sentences. [19] proposed to map image features from CNN to a common word embedding space, and generating sentences using Long-Short Term Memory (LSTM) network. Their work is extended by Xu et al. [4] who incorporated an attention mechanism, allowing the network to focus on salient objects. Following this, [20] further extended this framework by adding a reviewer stage between the encoder and decoder. Tan and Chan [21, 22] proposed a phrase LSTM model, which has two levels of LSTMs, one to model the sentence composed of phrases, and another to generate words in a phrase. [23] used multi-instance learning framework to learn 1000 visual detectors as the conditional inputs to a language model, and You et al. [7] enhanced the performance by learning the semantic attention on visual attributes. More examples of attribute models include [24, 25]. Park et al. [12] proposed to use context memory to personalise the captions for Instagram images. Wang et al. [26] proposed a deep bidirectional LSTM model to harness history and future context information, and is extended by [27] with the integration of multi-task learning. Dai and Lin [28] proposed Contrastive Learning to encourage distinctiveness of the generated captions. Although most of the aforementioned approaches achieved very promising results, all of these models do not scale naturally to large vocabulary size. Most if not all recent image captioning works focused on raw performance with the built of exotic encoder-decoder style networks with attention and placed little emphasis on reducing the computational costs of their models. In this paper, we introduce to the community a new research direction - a compact model with attention named COMIC.

Compact model. Building a compact model is an ongoing effort in the domain of deep learning [29, 30]. In this paper, we will focus on efforts in the field of neural natural language processing, as it is closer to image captioning. There are many existing works involving the use of encoding as a pre-processing step. For instance, Nakagawa [31] proposed a hybrid method for Chinese and Japanese word segmentation, using word-level information for known words, and character-level information for unknown words. [32] studied numerous encoding methods for text classification in Chinese, English, Japanese and Korean. [33] encoded rare words using Huffman encoding into subword symbols. Similarly [34] proposed using Byte-Pair Encoding (BPE) to segment rare words into subword units. However, it can only be used on English or languages with Latin character. Gillick et al. [35] treated the text as a sequence of variable-length UTF-8 bytes for text sequence annotation. While it does not involve natural language sequence generation, the work allows for a more compact representation of the word sequence. [36] proposed using CNN as encoder to produce radical-embeddings for Chinese and Japanese, resulting in reduced embedding vocabulary size. Li et al. [37] proposed building a word embedding table to factorise each word prediction into a 2-step process. The word embedding table is optimised separately using the minimum cost maximum flow (MCMF) algorithm. In our work, we show that it is possible to achieve good performance (i.e. generate a decent image caption) despite having an extremely small embedding vocabulary size.

Summary. Compared to regular image captioning models, COMIC has vastly fewer learnable parameters, leading to reduced requirement on GPU memory and storage. A closely related work to ours is LightRNN [37] but with few differences - i) COMIC requires only a single word embedding matrix (as opposed to two in LightRNN); ii) COMIC does not necessitate any changes in the model architecture (LightRNN requires a word embedding table); and iii) LightRNN is applied for language modelling only. On the other hand, our proposed method is orthogonal to compression and pruning based methods such as [38, 39]. Compression methods encode the trained weights of a full CNN into a smaller representation, while pruning methods are applied only after the full dense model has started the training process. In contrast, our method directly reduces the number of learnable parameters in the first place, thus producing a compact model. Moreover, [38, 39] are applied for image classification instead of image captioning. We believe that the aforementioned methods can be applied on top of COMIC to achieve even higher savings in terms of storage and parameters.

III Overall Architecture

Following recent works, we formulate the image captioning task as a translation problem, where a probabilistic model is used to “translate” an image with fixed-size representation into a fully-formed English sentence. As such, we adopt a modified version of Show, Attend and Tell [4] framework as our model architecture, since it provides good performance on the image captioning task. This model will also serve as the baseline for our experiments. For clarification, we will refer to output projection and output embedding; embedding dimension and word size interchangeably. All the model size calculations in this paper include only the decoder and attention module (the encoder, i.e. CNN is excluded).

Suppose $\left\{S_{0},\>\cdots\>,\>S_{L}\right\}$ is a sequence of words, our model directly maximises the probability of the correct description given an image $I$ using the following formulation:

[TABLE]

where $p\left(S_{t}\,|\,I,\;S_{0\>:\>t-1},\;c_{t}\right)$ is the probability of generating a word given an image $I$ , previous words $S_{0\>:\>t-1}$ , and context vector $c_{t}$ .

Although in principle any RNNs can be used, LSTM cell [40] (with forget bias) is chosen as it has shown SOTA performance on sequential tasks such as translation [41] and image captioning. For a LSTM network with $n$ units, we initialise the hidden state of LSTM with image embedding vector through a pre-activation weight layer with layer normalisation (LN) [42]:

[TABLE]

where $W_{I}\in\mathbb{R}\>^{n\times z}$ is a weight matrix and $LN\left(\cdot\right)$ is the LN function.

The attention function used in this paper is the “soft-attention” introduced by [1] and used in [4], where a multilayer perceptron (MLP) with a single hidden layer is employed to calculate the attention weights on a particular feature map. The context vector $c_{t}$ is then concatenated with previous predicted word embedding to serve as input to the LSTM. Finally, a probability distribution over the vocabulary is produced from the hidden state $h_{t}$ :

[TABLE]

where $E_{w}\in\mathbb{R}\,^{m\times v}$ and $E_{o}\in\mathbb{R}\,^{v\times n}$ are input and output embedding matrices respectively; $W_{M0}\in\mathbb{R}\,^{k\times r}$ , $W_{M1}\in\mathbb{R}\,^{k\times n}$ , $W_{M2}\in\mathbb{R}\,^{1\times k}$ are weight matrices; and $\left[\,,\,\right]$ is the concatenation operator. $p_{t}$ is the probability distribution over the vocabulary; $m_{t}$ is the memory state; $x_{t}$ is the current input to LSTM; $c_{t}\in\mathbb{R}\,^{q}$ is the context vector; $f\in\mathbb{R}\,^{|F|\times r}$ is the feature map and $f_{j}\in\mathbb{R}\,^{r}$ is the vector extracted from location $j$ ; $\alpha_{tj}$ is the attention weight at time step $t$ and location $j$ ; $S_{t-1}\in\mathbb{R}\,^{m}$ is the one-hot vector of previous word; $\epsilon$ is the softmax temperature.

IV Towards Compact Image Captioning Model

This paper introduces COMIC, a simple yet effective framework that consists of radix encoding, feature map projection weight tying and multi-head additive attention that work together to built a compact image captioning model with attention without affecting the original accuracy.

IV-A Radix Encoding

The idea of the radix encoding is to transform the word indices to a higher base, splitting every word token into $d$ tokens where $d\geq 2$ . Although it is possible to achieve reduction in the vocabulary size using BPE [34], it can only be used on English and other languages using Latin characters. On the other hand, radix encoding can in theory be used on all languages including Chinese, Japanese, Korean etc. For example in Fig. 2, with a base of $v=128$ , the word token “a” with an index of $i=0$ will be encoded using two tokens of $\hat{i_{0}}=0$ and $\hat{i_{1}}=0$ ; while the word token “asphalt” with an index of $i=2118$ will be encoded using two tokens of $\hat{i_{0}}=16$ and $\hat{i_{1}}=70$ . We also define two special tokens where $\langle$ GO $\rangle$ marks the start of a sequence and $\langle$ EOS $\rangle$ marks the end of the sequence. For easy decoding, the special tokens are represented using only one token. $\langle$ GO $\rangle$ is assigned with an index of $\hat{i}=v$ (one-hot vector $e^{(v)}$ ) and $\langle$ EOS $\rangle$ is assigned with an index of $\hat{i}=v+1$ (one-hot vector $e^{(v+1)}$ ). This enables radix encoding to be used without any modification to the existing model architectures. To generate a sequence, one simply run inference using beam search as usual and apply post-processing on the output tokens. The post-processing can be done by either converting the encoded index $\hat{i_{0}},\hat{i_{1}}$ back to base-10 index $i$ , or by constructing a decoding tree dictionary.

With this encoding scheme, we managed to re-represent the original corpus vocabulary $V_{o}$ of size $v^{d}$ using an encoded embedding vocabulary $V_{e}$ of size $v$ . This leads to a huge reduction in the model complexity. For example, the popular MS-COCO dataset often yields a vocabulary of $8,000$ to $10,000$ words while the InstaPIC-1.1M dataset yields around $22,000$ to $40,000$ words. With the proposed radix encoding, $V_{e}$ can be set to a much lower size such as $v=128$ or $v=256$ , a reduction of almost $39\times$ in the MS-COCO dataset and $99\times$ in the InstaPIC-1.1M dataset. Results are given in Section V-C1, Table I.

IV-B Feature Map Projection Weight Tying

For most of the image captioning models with visual attention [4, 7, 8, 9, 10], the attention function operates on higher level feature map of the CNN in order for the context vector $c_{t}$ to capture higher level representation. Such feature maps usually comes with a large channel dimension, such as $r=512$ for VGG-16 and $r=832$ for GoogLeNet. This in turn increases the RNN’s input size and their weight matrices. To combat this issue, we propose a down-projection algorithm on the feature map such that the projected map has lesser channels, given by

[TABLE]

where $W_{f}\in\mathbb{R}\,^{q\times r}$ is a weight matrix, $q$ is the number of channels of the projected feature map and $q\ll r$ . As shown in Table II (untied), a small projection size such that $q\ll r$ can reduce the model complexity, and at the same time it provides extra representation power to the language model.

However, still, the extra projection layer will naturally incur additional computational cost. To further alleviate this complexity issue, we introduce weight sharing on the feature map projection $W_{f}$ and attention MLP weights $W_{M0}$ , given by

[TABLE]

With this, the projected feature map $W_{M0}\,f_{j}$ can be calculated in advanced and share with the attention MLP, and so the visual attention module can be introduced in COMIC without incurring extra computational cost as to conventional approaches. Table II shows that the attention module is put forward in a lower computational cost without compromising the accuracy.

IV-C Multi-Head Additive Attention

Multi-head attention [43] separates the feature map channels into groups, where each attention head can attend to different areas of the image separately depending on the channel group. In other words, each location of each channel group is assigned an attention weight with a MLP. This approach is opposed to the regular single-head attention which applies attention weights equally across all of the feature map channels, leading to averaging of contextual information from multiple regions.

Technically, for single-head attention, we use a single MLP with hidden size $k$ and obtain a $q$ -dimensional context vector $c_{t}$ . For multi-head attention with $g$ heads, we use $g$ separately learned MLPs with hidden size $k/g$ , with each head produces a $q/g$ -dimensional output vector. The output vectors are then concatenated to produce the final context vector $c_{t}$ with $q$ -dimensions. Due to the dimension reduction of each head, the total computational cost will be the same as to the single-head attention with full dimensionality.

To take advantage of this, we apply the multi-head dot-product attention[43], together with additive attention (MLP) and image captioning to increase the effective resolution of the attention module via an ensemble of attention modules. In practise, we combine the separate MLPs into a single MLP to maximise parallelism. Experiments on MS-COCO dataset (Table II) show that multi-head can improve the original accuracy with same model complexity. This is the first time multi-head additive attention is used in image captioning111[44] employed multi-head dot-product attention, however no result on image captioning is provided..

V Experiments and Discussion

V-A Model Details

The LSTM model is trained in an unrolled form to predict each word of the sentence after it has seen the image, the current context vector and all the preceding words, as given by $p\left(S_{t}\;|\;I,\;S_{0\>:\>t-1},\;c_{t}\right)$ . As usual, each word is represented as one-hot vector $S_{t}$ of dimension equal to the size of the dictionary. The training is performed by minimising the loss w.r.t. all the parameters except the image model. To tackle overfitting, we employed dropout at the input and output of the LSTM. Our loss function is the sum of the negative log likelihood of the correct word at each time step, doubly stochastic attention regularisation as employed in [4] and L2 weight loss as given below:

[TABLE]

Unless stated otherwise, all the models used in our experiments have the following basic configurations. All models are implemented using TensorFlow. The image model used in our work is GoogLeNet (InceptionV1) with batch normalisation [45, 46] pre-trained on ImageNet. The input images are resized to $256\times 256$ , and randomly flipped and cropped to $224\times 224$ before being fed to the CNN. The image embedding size is $z=1024$ . The attention function operates on the “Mixed-4f” map $f\in\mathbb{R}\>^{196\times 832}$ , with MLP size of $k=512$ . The projected feature map for untied models in Table II have $q=512$ channels. The LSTM network consists of a single layer with hidden state size of $n=512$ . The word size is set to $m=256$ dimensions. The optimiser used for training is Adam [47], with batch size of $32$ .

The initial learning rate is set to $1\times 10^{-3}$ , and is halved every $4$ epochs until a minimum of $2\times 10^{-4}$ . All models are trained for $20$ epochs. The input and output dropout rates for LSTM are both set to $0.35$ . Weight decay rate is set to $\lambda=1\times 10^{-5}$ . All trainable parameters are initialised randomly using Xavier initialisation [48]. For inference, we used beam search in order to better approximate $S=\arg\max_{S\,\prime}\>p(S^{\>\prime}\,|\,I)$ . We use beam size $b=3$ with no length normalisation for all experiments unless noted otherwise. All hyperparameters are chosen based on educated guesses due to limited computational resources.

V-B Experiment Setup

We conducted our experiments on two public English captioning datasets, namely MS-COCO [11] and InstaPIC-1.1M [12]. MS-COCO dataset contains $123,287$ images and each image is given at least $5$ captions by different AMT workers. We use the publicly available split222http://cs.stanford.edu/people/karpathy/deepimagesent/ in the work of [6], which use $5,000$ images for validation, and another $5,000$ for testing. InstaPIC dataset contains $648,761$ images for training, and $5,000$ images for testing. Each Instagram image is paired with one user caption. This dataset is challenging, as its captions are natural posts with varying formats. Following [28], we reserved $2,000$ images randomly from the training set for validation.

All the scores are obtained using the publicly available MS-COCO evaluation toolkit333https://github.com/peteanderson80/coco-caption , which computes BLEU [13], METEOR [14], ROUGE-L [15], CIDEr [16] and SPICE [17]. For sake of brevity, we label BLEU-1 to BLEU-4 as B-1 to B-4, and METEOR, ROUGE-L, CIDEr, SPICE as M, R, C, S respectively. For MS-COCO, we use the publicly available tokenised captions2 [6], filtering out words that occur less than $5$ times and truncating sentences longer than $20$ words. For InstaPIC, we use the publicly available tokenisation script444https://github.com/cesc-park/attend2u, and select $25,595$ most frequently used words as our vocabulary. We also truncate captions longer than $18$ words.

V-C Ablation Study

V-C1 Tokenisation and encoding

In this section, we examine the effect of the introduction of Radix Encoding scheme. From Table I, it can be seen that the regular word-based model performed the best. This is followed by Radix models using base-128, base-64 and finally the character-based model. The result can be attributed to the much shorter sentence length when using word tokens, which alleviates long-term dependency learning issues. Also, this performance degradation is an expected trade off of parameter reduction and we believe the result is still comparable. For instance, we can notice the performance gap between word and radix encoding model is moderate ( $2.3\%$ in average), while the number of parameters reduced drastically (by $62\%$ ). This is almost one-third of the original amount which is comparable to the character model, yet at the same time it obtains better performance than the character model. This shows that radix encoding is able to reduce the complexity of image captioning models without affecting much on the original accuracy.

V-C2 Attention module

In this section, we investigate the effect of different attention configurations, by varying the number of attention heads with and without the projection weight tying. The models used are as described in Section V-A, but with word size set to $m=64$ . Table II shows that it is possible to introduce visual attention module in a more compact way without compromising the original accuracy. For instance, when the feature map projection is employed (i.e. tied) in Radix, base-128, we found that even with lesser parameters, having the extra projection layer contributes a slight improvement in the overall performance. This is more obvious when the multi-head attention is applied. This phenomena is also spotted in the regular word-based model. Without the projection, multi-head attention often provide little to no benefit compared to regular single-head attention. This can be attributed to the extra projection provides the model ability to group channels that are relevant to each attention head together, forming contiguous groups.

From our further investigation on the type of projection, we notice that there are two opposite trends. When using single-head attention, the tied models generally performed slightly worse than the untied counterparts. This shows that the benefit obtained by the extra projection is counteracted by the reduction in the parameter count. On the other hand, the tied models generally performed better than untied counterparts when using the multi-head attention, despite having lesser parameters. This can be understood as the tied projection layer receives extra gradient information via weight sharing from both the multi-attention module and RNN, while training.

In terms of the inclusion of multi-head additive attention, we can notice that compared to a single head, models using $8$ heads yields improvements of up to $+4\%$ on CIDEr score. Furthermore, as shown in Table II, the other metric scores also improve across the board as the number of attention heads increases, when the tied projection is used. This is consistent with the findings of [43], where their performance on the WMT 2014 English-to-German translation task improves as the number of heads increases (up to $16$ heads). Note that, as aforementioned in Section IV-C this overall performance improvement is essentially free as each individual head operate on a reduced dimension compared to the single-head.

VI Quantitative Results

VI-A State-of-the-art comparison

In this section, our COMIC model is a radix encoding model with $8$ attention heads and tied feature map projection, the rest being identical to baseline. To provide a fair comparison, we trained two sets of baseline models. The first set consists of the standard baseline named “Baseline” and “Baseline-8” with $8$ attention heads without feature map projection; while the second set consists of a pair of slim baseline models named “Baseline-S” where the parameter counts are reduced to match the COMIC models. “Baseline-SC” models have $n=k=160$ and $m=128$ , and “Baseline-SI” models have $n=k=80$ and $m=64$ . We trained all the baselines and COMIC - $\,v$ models for $30$ epochs, where $v$ denotes the choice of base number. The base number of COMIC is chosen so that the number of tokens needed to encode a word token is $d=2$ and $v^{2}\geq|V_{o}|$ . As MS-COCO word model has a vocabulary size of $V_{o}=9,962$ , a base number of $v=128$ or $v=256$ is sufficient to encode the entire $V_{o}$ while minimising the increase in sequence lengths. On the other hand, InstaPIC word model has $V_{o}=25,598$ , hence larger base numbers $v=160$ and $256$ are used. We would like to note that our metric scores are obtained using a single model instead of an ensemble.

Table III-IV show the metric scores achieved by our baselines, COMIC and SOTA methods. On both datasets, our COMIC models managed to perform on par with the baselines even having much lower parameter count and vocabulary size. For example, on the MS-COCO dataset, the loss in performance of COMIC-256 is merely $0.45\%$ on CIDEr when compared to the baselines, despite with only $33\%$ of the parameters and a vocabulary size of $258$ ( $39\times$ reduction). On the InstaPIC dataset, the complexity reduction is even more drastic. Despite having much lesser parameters ( $16.7\%$ of baseline) and vocabulary size of only $258$ ( $99\times$ reduction), COMIC-256 still manages to perform on par with baseline models and even outperforms it on certain metrics. When compared to the slim baseline models with comparable parameter counts, our COMIC models again have better performance. This shows the effectiveness of the proposed methods in reducing the model complexity and at the same time minimising its impact on overall performance across five different evaluation metrics. Our COMIC models also compare favourably to SOTA approaches, losing moderately to attribute-based approaches in the MS-COCO dataset, and only to the latest CSMN [12] and AACL [28] approaches in the InstaPIC dataset, despite operating on a much condensed vocabulary size.

As a summary, although there is a slight performance drop in some of the metric scores when comparing COMIC against the baselines in Table III-IV, this performance degradation is an expected trade off of parameter reduction and we believe the results are still comparable. Then, when compared to the SOTA methods (with the exception of ACVT [24] which implemented an attribute dictionary), we showed that the performance of our proposed model in overall is very competitive in both of the datasets. In particular, those that have similar architectures (i.e. Soft and Hard Attention [4], and Review Net [20]) as to our proposed work.

VI-B Uniqueness and length of captions:

It has been pointed out that multimodal RNN-based approaches tend to reconstruct previously seen captions [51]. Hence, we compare our model with baselines in terms of the uniqueness and length of the generated captions in Table V. A caption is considered to be unique if it’s not seen in the pre-processed training corpus. From the results, we can see that although COMIC uses an encoded vocabulary, it still managed to generate considerably more unseen (unique) captions compared to the baselines. The average length of captions generated by the COMIC is also longer compared to the baselines.

The trend is due to the decoding noise introduced by the radix encoding. In other words, in addition to the long-term dependencies between words, successful generation of captions relies strongly on accurately modelling the short-term dependencies between tokens. This has increased the difficulty along with exposure bias for the increased uniqueness of captions generated by our models, as well as the increased length of captions.

VII Qualitative Results

In this section, we provide some examples of the generated captions from our model in Fig. 3-4 for both of the MS-COCO and InstaPIC datasets555More results can be found in our supplementary material. We can see that the captions generated by COMIC-256 are grammatically correct and are not affected by the vocabulary encoding scheme. In many cases, COMIC-256 even managed to provide finer details when describing the images compared to the baseline. For instance in the first image of Fig. 3, our model properly describes the image content “a man standing next to a zebra in a field”, while the baseline model only able to generate “a man is standing next to a zebra”.

To better understand our model, Fig. 5 visualises the multi-head attention maps for different words in the generated caption. Going through each of the attention maps, we can see that our proposed model effectively delegates each attention head to different locations. In other words, each head learn to focus on subjects, objects or background separately. For example in the first image, we can visualise that the 3rd head ( $g_{2}$ ) generally attends to the car. Meanwhile, the 4th head ( $g_{3}$ ) is focused on the cat at the beginning, and fades out when the model moves to the other words. The 5th head ( $g_{4}$ ) attends to the space around the roof of the car, aiding in predicting “on top of”. Finally, the 7th head ( $g_{6}$ ) attends to the boundary between the cat and the car while the model is predicting the word “sitting”. The second image shows similar task assignments. In the third image that has multiple subjects, we can see that each head can separately attend to the background, elephant and the people sitting on top.

VIII Conclusion

This paper studied image captioning problem from a new perspective where it presented COMIC - a compact image captioning model with attention module. Experiments were conducted in the MS-COCO and InstaPIC-1.1M datasets, and the results showed that COMIC overall performance was not affected despite has a reduction of 33 $\times$ -99 $\times$ in the vocabulary size. In future work, we would like to investigate the impact of different encoders (i.e. CNN models) such as MobileNets [30] on the overall performance and to train the radix encoding models in a greedy decoding setting using reinforcement learning methods, such as Policy Gradient [52] to avoid the “exposure bias” problem.

Acknowledgement

We gratefully acknowledge the support of NVIDIA Corporation with the donation of Titan Xp GPU used for this research.

Appendix

In this supplementary material we provide additional visualisations of the attention heads in our Compact Image Captioning (COMIC) model on MS-COCO (in Sec. IX-A) and InstaPIC-1.1M (in Sec. IX-B) datasets. Furthermore, we also show some randomly sampled images with qualitative results in Sec. X.

IX Multi-head Attention Maps

In Fig. 6 to 10, the attention maps of different heads are denoted by $g_{a}$ where $a=[0,7]$ . Attention maps with the most activity are selected for better visualisations. Going through each of the attention maps, we can see that the models have effectively learned how to delegate each attention head to different tasks. In other words, each head has learned to focus on subjects, objects or background separately.

IX-A MS-COCO Dataset

Fig. 6: We can see that both $g_{1}$ and $g_{2}$ attend to the surroundings of the zebras. While both $g_{5}$ and $g_{6}$ attend to the group of zebras, they provide different context to the language model as they switch on alternately of each other. $g_{6}$ likely provides context for the noun “zebra” while $g_{5}$ provides context for the verb “standing”.

Fig. 7 (top): Here, we can see that both $g_{0}$ and $g_{1}$ attend to the surfboard, but $g_{0}$ also attends to the surrounding ocean which might provide the general context. In contrast, $g_{1}$ is strongly focused on the surfboard. $g_{2}$ attends to the waves. Lastly, both $g_{5}$ and $g_{6}$ attend to the subject with $g_{5}$ focusing on the lower body and $g_{6}$ focuses on the head and torso.

Fig. 7 (bottom): Although the model misidentifies the player as male, the attention is focused on the correct regions. $g_{0}$ attends to the player and the court, that provide the general context. $g_{1}$ attends to the cap, racquet, shoes and clothing. This provides the cue on the type of sports, which our model predicted correctly. $g_{2}$ again attends to the background, in particular the court lines. Similar to the surfing example above, both $g_{5}$ and $g_{6}$ attend to the subject with $g_{5}$ focuses on the lower body and $g_{6}$ focuses on the head and torso.

IX-B InstaPIC-1.1M Dataset

Fig. 8: We can see that $g_{0}$ attends mainly to the sky region especially when the model is predicting “top” and “world”. Basically, $g_{2}$ attends to the entire image, which provide the general context. Lastly, $g_{6}$ attends to both the foreground and faraway regions, which provide cues that the image is a bird’s-eye view of the bay region.

Fig. 9 (Top): It can be seen that $g_{0}$ attends to the background. $g_{5}$ attends to basically the entire image, which provide the general context. Both $g_{1}$ and $g_{6}$ attends to the dog, with $g_{1}$ is more focused than $g_{6}$ .

Fig. 9 (Bottom): We can see that $g_{1}$ attends to the hair and face of the subject, while both $g_{3}$ and $g_{6}$ attend strongly to the facial regions. $g_{5}$ attends to the entire image.

Fig. 10 (Top): We can clearly observe that $g_{0}$ attends to the sky regions, while $g_{2}$ attends to the tree, road and sun. Both $g_{3}$ and $g_{6}$ attend to the sun, with $g_{3}$ being more focused than $g_{6}$ .

Fig. 10 (Bottom): Here, we can see that $g_{0}$ mainly attends to the plate, while $g_{5}$ attends to the food as a whole. Both $g_{1}$ and $g_{6}$ attend to different regions or food items on the plate.

X Generated Captions

For the generated captions, we provide results from both our COMIC-256 model and the baseline word model in Figure 11 to 15. Captions inside the solid green box are generated by COMIC-256 model, and captions inside the dashed blue box are generated by the baseline method. We can see that for most images, our proposed method matches or in some cases outperforms the baseline method. For instance, we can see that the captions generated by COMIC-256 model are grammatically correct and this shows that it does not affected by the vocabulary encoding scheme. In some cases, COMIC-256 model managed to provide finer details when describing the images compared to the baseline. Finally, we demonstrate the ability of our proposed method to generate variable length captions.

We also explicitly chose some failure examples in which COMIC-256 model performs no better than baseline method in Figure 12 for MS-COCO dataset and Figure 15 for InstaPIC-1.1M dataset. We can see that incorrect recognition of objects or missing main objects in the image is still the dominant cause of error.

X-A MS-COCO Dataset

X-B InstaPIC-1.1M Dataset

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the International Conference on Learning Representations (ICLR) , 2015.
2[2] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014.
3[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems (NIPS) , 2014, pp. 3104–3112.
4[4] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (ICML) , 2015, pp. 2048–2057.
5[5] O. Press and L. Wolf, “Using the output embedding to improve language models,” ar Xiv preprint ar Xiv:1608.05859 , 2016.
6[6] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2015, pp. 3128–3137.
7[7] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016, pp. 4651–4659.
8[8] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning Where to See and What to Tell: Image Captioning with Region-based Attention and Scene-specific Contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 39, no. 12, pp. 2321–2334, 2017.