Scene-based Factored Attention for Image Captioning

Chen Shen; Rongrong Ji; Fuhai Chen; Xiaoshuai Sun; Xiangming Li

arXiv:1908.02632·cs.CV·September 4, 2019

Scene-based Factored Attention for Image Captioning

Chen Shen, Rongrong Ji, Fuhai Chen, Xiaoshuai Sun, Xiangming Li

PDF

Open Access

TL;DR

This paper introduces a scene-based factored attention module for image captioning that leverages scene concepts to improve caption quality, outperforming existing methods on the Microsoft COCO benchmark.

Contribution

It proposes a novel scene-based factored attention mechanism that explicitly incorporates scene concepts into image captioning models, enhancing their semantic understanding.

Findings

01

Significant performance improvement on Microsoft COCO benchmark

02

Outperforms state-of-the-art approaches across multiple metrics

03

Effective integration of scene concepts into attention mechanisms

Abstract

Image captioning has attracted ever-increasing research attention in the multimedia community. To this end, most cutting-edge works rely on an encoder-decoder framework with attention mechanisms, which have achieved remarkable progress. However, such a framework does not consider scene concepts to attend visual information, which leads to sentence bias in caption generation and defects the performance correspondingly. We argue that such scene concepts capture higher-level visual semantics and serve as an important cue in describing images. In this paper, we propose a novel scene-based factored attention module for image captioning. Specifically, the proposed module first embeds the scene concepts into factored weights explicitly and attends the visual information extracted from the input image. Then, an adaptive LSTM is used to generate captions for specific scene types. Experimental…

Tables3

Table 1. Table 1: Ablation study results on MS COCO Karpathy test split. The notation of ”VC” denotes that we add traditional visual concepts attention and the notation of ”Scene” denotes that we add factored attention module. The notation of ”[VC, Scene]” denotes that we concatenate the visual concepts and scene concepts like ”VC”.

Model	Bleu1	Bleu2	Bleu3	Bleu4	METEOR	ROUGE	CIDEr	SPICE
Baseline	0.764	0.602	0.460	0.349	0.269	0.559	1.088	0.201
Baseline + VC	0.765	0.605	0.468	0.359	0.274	0.564	1.131	0.205
Baseline + Scene	0.776	0.616	0.473	0.359	0.271	0.568	1.124	0.205
Baseline + [VC, Scene]	0.776	0.618	0.476	0.361	0.272	0.567	1.132	0.208
Baseline + VC + Scene	0.776	0.618	0.477	0.367	0.277	0.570	1.147	0.209

Table 2. Table 2: Single-model image captioning performance on MS COCO Karpathy test split. Results are reported for models trained with standard MLE loss in Table (top) and RL-based methods in Table (bottom). The numbers in boldface are the best known results and underlined numbers are the result of the second.

Model	Bleu1	Bleu2	Bleu3	Bleu4	METEOR	ROUGE	CIDEr	SPICE
NIC[45]	0.663	0.423	0.277	0.183	0.237	-	0.855	-
Soft-Attention[51]	0.707	0.492	0.344	0.243	0.239	-	-	-
Hard-Attention[51]	0.718	0.504	0.357	0.250	0.230	-	-	-
ATT[54]	0.709	0.537	0.402	0.304	0.243	-	-	-
LSTM-A5[53]	0.730	0.565	0.429	0.325	0.251	0.538	0.986	-
ARNet[10]	0.740	0.576	0.440	0.335	0.261	0.546	1.034	0.190
LTG-Review-Net[21]	0.743	0.579	0.442	0.336	0.261	0.548	1.039	-
Up-Down[2]	0.772	-	-	0.362	0.270	0.564	1.135	0.203
DA[16]	0.758	-	-	0.357	0.274	0.562	1.119	0.205
Ours	0.776	0.618	0.477	0.367	0.277	0.570	1.147	0.209
SCST:Att2in[37]	-	-	-	0.313	0.260	0.543	1.013	-
SCST:Att2all[37]	-	-	-	0.300	0.259	0.534	0.994	-
BAM[6]	-	-	-	0.350	0.262	0.559	1.111	-
ATTN+C+D(1)[30]	-	-	-	0.363	0.273	0.571	1.141	0.211
Up-Down[2]	0.798	-	-	0.363	0.277	0.569	1.201	0.214
DA[16]	0.799	-	-	0.375	0.285	0.582	1.256	0.223
Ours	0.803	0.646	0.601	0.381	0.285	0.582	1.268	0.220

Table 3. Table 3: Quantitative comparisons to the state-of-the-art works in image captioning on dataset c5 and c40 evaluated on the online MS-COCO server. Both SCST:Att2all and Up-Down are an ensemble of 4 models while ours is a single model. LSTM-A3 utilizes Resnet-152 based visual feature. The numbers in bold are the best and the underlined numbers are the second.

Model	Bleu1		Bleu2		Bleu3		Bleu4		METEOR		ROUGE		CIDEr
Model	c5	c40	c5	c40	c5	c40	c5	c40	c5	c40	c5	c40	c5	c40
Google NIC[45]	0.713	0.895	0.542	0.802	0.407	0.694	0.309	0.587	0.254	0.346	0.530	0.682	0.943	0.946
ATT[54]	0.731	0.901	0.565	0.816	0.424	0.710	0.316	0.600	0.251	0.336	0.535	0.683	0.944	0.959
Review Net[52]	0.720	0.900	0.550	0.812	0.414	0.705	0.311	0.597	0.256	0.347	0.535	0.686	0.965	0.969
Adaptive[29]	0.748	0.920	0.584	0.845	0.444	0.744	0.336	0.637	0.264	0.359	0.555	0.705	1.042	1.059
PG-BCMR[28]	0.754	0.918	0.591	0.841	0.445	0.738	0.332	0.624	0.257	0.340	0.550	0.695	1.013	1.031
SCST:Att2all[37]	0.781	0.937	0.619	0.860	0.470	0.759	0.352	0.645	0.270	0.355	0.563	0.707	1.147	1.167
LSTM-A3[53]	0.787	0.937	0.627	0.867	0.476	0.765	0.356	0.652	0.270	0.354	0.564	0.705	1.160	1.180
DA[16]	0.794	0.944	0.635	0.880	0.487	0.784	0.368	0.674	0.282	0.370	0.577	0.722	1.205	1.220
Up-Down[2]	0.802	0.952	0.641	0.888	0.491	0.794	0.369	0.685	0.276	0.367	0.571	0.724	1.179	1.205
Ours	0.803	0.947	0.647	0.887	0.500	0.797	0.379	0.690	0.282	0.372	0.581	0.730	1.235	1.256

Equations41

h_{t} = L S T M (x_{t}, h_{t - 1}),

h_{t} = L S T M (x_{t}, h_{t - 1}),

i_{t}

i_{t}

f_{t}

o_{t}

g_{t}

m_{t}

h_{t}

x_{t}^{1}

x_{t}^{1}

h_{t}^{1}

p^{1} (y_{t} ∣ y_{1 : t - 1}) = S o f t ma x (W_{y} h_{t}^{1}),

p^{1} (y_{t} ∣ y_{1 : t - 1}) = S o f t ma x (W_{y} h_{t}^{1}),

x_{t}^{2}

x_{t}^{2}

h_{t}^{2}

p^{2} (y_{t} ∣ y_{1 : t - 1})

p^{2} (y_{1 : T}) = t = 1 \prod T p^{2} (y_{t} ∣ y_{1 : t - 1}),

p^{2} (y_{1 : T}) = t = 1 \prod T p^{2} (y_{t} ∣ y_{1 : t - 1}),

S

S

W_{h}

a_{i, t}

a_{i, t}

α_{t}

\overset{v}{^}_{co n v, t}

b_{t}

b_{t}

β_{t}

\overset{v}{^}_{o bj, t}

\overset{v}{^}_{t} = [\overset{v}{^}_{co n v, t}, \overset{v}{^}_{o bj, t}],

\overset{v}{^}_{t} = [\overset{v}{^}_{co n v, t}, \overset{v}{^}_{o bj, t}],

L_{M L E} (θ) = - t = 1 \sum T l o g p (\overset{y}{ˉ}_{t} ∣ \overset{y}{ˉ}_{1 : t - 1}),

L_{M L E} (θ) = - t = 1 \sum T l o g p (\overset{y}{ˉ}_{t} ∣ \overset{y}{ˉ}_{1 : t - 1}),

L_{M L E} (θ) = = γ \cdot L_{M L E}^{1} (θ) + (1 - γ) \cdot L_{M L E}^{2} (θ), - γ \cdot t = 1 \sum T l o g p^{1} (\overset{y}{ˉ}_{t} ∣ \overset{y}{ˉ}_{1 : t - 1}) - (1 - γ) \cdot t = 1 \sum T l o g p^{2} (\overset{y}{ˉ}_{t} ∣ \overset{y}{ˉ}_{1 : t - 1}),

L_{M L E} (θ) = = γ \cdot L_{M L E}^{1} (θ) + (1 - γ) \cdot L_{M L E}^{2} (θ), - γ \cdot t = 1 \sum T l o g p^{1} (\overset{y}{ˉ}_{t} ∣ \overset{y}{ˉ}_{1 : t - 1}) - (1 - γ) \cdot t = 1 \sum T l o g p^{2} (\overset{y}{ˉ}_{t} ∣ \overset{y}{ˉ}_{1 : t - 1}),

L_{R} (θ) = - E_{y_{1 : T}^{s} \sim p^{2}} [r (y_{1 : T}^{s})],

L_{R} (θ) = - E_{y_{1 : T}^{s} \sim p^{2}} [r (y_{1 : T}^{s})],

▽_{θ} L_{R} (θ) \approx - (r (y_{1 : T}^{s}) - r (\overset{y}{^}_{1 : T})) ▽_{θ} l o g p^{2} (y_{1 : T}^{s}) .

▽_{θ} L_{R} (θ) \approx - (r (y_{1 : T}^{s}) - r (\overset{y}{^}_{1 : T})) ▽_{θ} l o g p^{2} (y_{1 : T}^{s}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory

Full text

Scene-based Factored Attention for Image Captioning

Chen Shen1, Rongrong Ji12, Fuhai Chen1, Xiaoshuai Sun1, Xiangming Li1

1Media Analytics and Computing Lab, Department of Artificial Intelligence,

School of Informatics, Xiamen University, 361005, China.

2Peng Cheng Laboratory, Shenzhen, China.

[email protected], [email protected],

{cfh3c.xmu, xiaoshuaisun.hit}@gmail.com, [email protected] Corresponding author.

Abstract

Image captioning has attracted ever-increasing research attention in the multimedia community. To this end, most cutting-edge works rely on an encoder-decoder framework with attention mechanisms, which have achieved remarkable progress. However, such a framework does not consider scene concepts to attend visual information, which leads to sentence bias in caption generation and defects the performance correspondingly. We argue that such scene concepts capture higher-level visual semantics and serve as an important cue in describing images. In this paper, we propose a novel scene-based factored attention module for image captioning. Specifically, the proposed module first embeds the scene concepts into factored weights explicitly and attends the visual information extracted from the input image. Then, an adaptive LSTM is used to generate captions for specific scene types. Experimental results on Microsoft COCO benchmark show that the proposed scene-based attention module improves model performance a lot, which outperforms the state-of-the-art approaches under various evaluation metrics.

1 Introduction

Describing what is in an image, known as image captioning, is a very challenging task, which attracts increasing attention in the multimedia research. In order to translate images to sentences, an encoder-decoder architecture is typically adopted for image captioning [45, 51, 46], which has achieved promising performance. Recent works in image captions prefer the usage of attention mechanism, which forces image captioning to dynamically focus on different regional features as needed, rather than being locked by a static image representation. Since object-centered visual concepts have been proven to be effective in visual recognition [35], some captioning methods [49, 53, 15] also prefer to selectively attend a set of detected object-centered visual concepts. These concepts are then combined into the hidden states of recurrent neural network (RNN) for dynamic caption generation.

Despite the exciting recent progress, those works model attention based on either regional features or object-centered visual concepts. However, attention driven by scene concepts has never been explicitly considered, which actually plays a very important role in determining the major keywords of captions. As shown in the left case of Fig. 1 (top), it is better to say ”a person is laying111Due to the variance of crowdsourcing labeling, the word ”laying” are used more frequently than ”lying” in the captions of MS COCO dataset.” than ”a person is sleeping” when the scene is obviously outdoor. By contrast, in the right case of Fig. 1 (top), when the photo is taken in a room with a man lying, it is more likely to get a caption as ”a man is sleeping”. Clearly the scene concepts have a considerable influence on the word generation.

It is intuitive to introduce scene cues into image captioning. A possible way to leverage the scene cues is to apply semantic concept attention. For example, one can follow You et al. [54] to attend scene cues as semantic concepts for attention. Nevertheless, the visual information is always hierarchical [25], which makes the existing works suboptimal. As the word probability distribution shown in Fig. 1 (middle), after partial sentence generated for images in Fig. 1 (top), the model with scene semantic attention is still not clear enough about choosing whether the word ”laying” or ”sleeping”. We argue that scene concepts and object-centered visual concepts should not be treated equally, since the scene concepts contain more global and macroscopic context information than object-centered visual concepts. It therefore needs a more explicitly mechanism in the attention module as core guidance.

In this paper, we argue that the fundamental issue lies in explicit and respective modeling of scene concepts, object-centered visual concepts and sentence generation. On one hand, the scene concepts are usually corresponding to the attribute keywords in captions. On the other hand, the context of scene concepts can guide to attend object-centered visual concepts when a sentence is generated. Driven by the above insights, we propose a novel scene-based factored attention module for image captioning. The framework of the proposed method is illustrated in Fig. 2. To fully encode the input image, we first integrate the hierarchical visual information (including regional features, object-centered visual concepts and scene concepts) to enrich keywords and details in caption generation. Then, we design a scene-based factored attention module to attend the hierarchical visual information. Generally speaking, we embed scene concepts into the hidden feature of an LSTM [20]. Conditioned on the embedded scene hidden feature, the module determines which features and object-centered visual concepts are more important by assigning the corresponding weights. Finally, the outputs of the factored attention module are fed into a second LSTM to generate the next word. As shown in Fig. 1 (bottom), our model with scene-based factored attention is more confident with the chosen words.

The contributions of this paper are summarized as follows: (1) We are the first to explicitly embed scene concepts in image captioning. We are also the first to explicitly model relevance among scene concepts, object-centered visual concepts and caption generation. (2) We propose a factored attention module to better perceive the hierarchical visual information. Quantitative comparisons to the state-of-the-art demonstrate our merits.

2 Related Work

Our work relates to three topics: image captioning, tensor factorization and scene understanding. In this section, we categorize and review related work as follows.

2.1 Image Captioning

Most existing image captioning methods rely on the encoder-decoder framework inspired by machine translation [3, 41]. The framework is used to ”translate” an image to a sentence, where the visual features are extracted from convolutional neural network (CNN) and fed into Long Short-Term Memory (LSTM) to generate captions. Image captioning techniques have been extensively explored in [22, 45, 31, 9, 4, 5]. A few models [51, 54, 2] seek to apply attention mechanism to bridge the gap of visual understanding and language processing. The prior attention mechanism relies on either regional convolution features or object-centered visual concepts extracted from images. The former allows the model to dynamically select regional features during sentence generation. And the latter, such as semantic attention [54, 49], applies top-down attention on detected object-centered visual concepts. However, these object-centered visual concepts have two major drawbacks. Firstly, they do not retain spatial information and scene guidance, which may make captions miss scene keywords and scene details. Secondly, they do not take the hierarchy of semantics into account, which may lead to sentence bias. As demonstrated in our experiments, considering the hierarchical semantic concepts at scene and object levels can better guide the attention selection and caption generation.

2.2 Tensor Factorization

Tensor factorization has been used in many multimedia tasks, such as attributes learning [32], motion style modeling [43], image transformations [40] and sequence learning [39, 50]. Recently, tensor factorization has been widely used in [24, 13, 15, 14], which can further improve the model performance. More specifically, Kiros et al. [24] used factored tensor to guide word embedding with visual features. Fu et al. [13] inferred a topic vector (named scene vector) for tensor factorization in LSTM. Gan et al. [15] used factorization to remedy dimension explosion. Gan et al. [14] introduced factored LSTM to learn different style captions. In contrast to these works, we use tensor factorization not only to explicitly model the relevance among visual information and sentence generation, but also to guide the attention selection mechanism.

2.3 Scene understanding

In the last few years, CNNs have emerged as powerful image representations for scene classification [33, 48, 38, 26, 55, 47, 17]. Thanks to the development of Scene-15, MIT Indoor-67, SUN-397 and Place datasets [56], the well-known scene classification task has been pushed forward with great progress and gradually weeded out hand-crafted features. Recently, deep convolutional networks have been exploited for scene classification by Zhou et al. [56]. We take full advantage of the recent scene understanding methods to help improve the quality of caption generation.

3 The Proposed Model

Firstly, a set of hierarchical visual information, i.e., regional features $v_{conv}$ , object-centered visual concepts $v_{obj}$ and scene concepts $v_{scene}$ are extracted from the input image. Secondly, scene-based factored attention module embeds scene concepts $v_{scene}$ into the current hidden feature $h^{1}_{t}$ of the first LSTM to attend regional features $v_{conv}$ and object-centered visual concepts $v_{obj}$ . Finally, the weighted visual information is fed into the second LSTM to generate the next word.

In Sec. 3.1, we briefly introduce the basic architecture of our proposed image captioning method. Then in Sec. 3.2, we introduce the factored attention module in details. Finally, in Sec. 3.3, we introduce the objective function used in our work.

3.1 Caption Generation

Long Short-Term Memory (LSTM) [20] is a widely-used Recurrent Neural Network (RNN), which is known to learn patterns with long-term temporal dependencies. We briefly refer to the operation of the LSTM over a single time step using the following notation:

[TABLE]

where $x_{t}$ is the input vector of LSTM, and $h_{t}$ is the hidden feature of LSTM.

The hidden feature at time step $t$ can be calculated via Eq. 1, formulated as follows:

[TABLE]

where $i_{t},f_{t},o_{t},m_{t}$ and $h_{t}$ are input gate, forget gate, output gate, memory cell and hidden feature, respectively. $\sigma$ and $\odot$ denote sigmoid function and an element-wise Hadamard product operator, respectively. For brevity, we omit all bias terms in the following paper.

LSTM’s core is a memory cell $m_{t}$ that maintains the multi-modal knowledge of the inputs $x_{t}$ observed with respect to the time step $t$ . Updating operations on the memory cell $m_{t}$ is modulated by three gates, i.e., the input gate $i_{t}$ , the output gate $o_{t}$ and the forget gate $f_{t}$ , which determine when and how the information flow. Especially, the input gate $i_{t}$ controls the input of the LSTM. The output gate $o_{t}$ manages the memory $m_{t}$ transfer to the hidden feature $h_{t}$ of the LSTM and generate the next word. And the forget gate $f_{t}$ decides whether to forget previous memory $m_{t-1}$ .

Our captioning model consists of two LSTM layers, referred as first LSTM and second LSTM. The superscript of variables in the equations is to distinguish which layer of LSTM. The first LSTM generates a hidden feature of the current sequence $h^{1}_{t}$ based on the input, which contains partial sequence output generated so far, the current input word and the context information of the second LSTM. It is formulated as follows:

[TABLE]

where $W_{e}\in\mathbb{R}^{E\times Q}$ is a word embedding matrix for a vocabulary of size $Q$ . $z_{t}$ is the input word of a one-hot vector at time step $t$ .

We define the notation $y_{1:T}$ as a sequence of words $(y_{1},y_{2},...,y_{T})$ , and get the first words conditional probability distribution at time step $t$ as follows:

[TABLE]

where $W_{y}\in\mathbb{R}^{Q\times H}$ is a learned weight matrix. Note that the output $p^{1}(y_{t}|y_{1:t-1})$ is a distribution of words only for loss optimization in training. The details will be described in Sec. 3.3.

In our proposed scene-based factored attention module, at each time step $t$ , we use the current hidden feature $h^{1}_{t}$ to get the attentive weighted visual information $\hat{v}_{t}$ , where the details will be described in Sec. 3.2.

We devise the second LSTM layer to make use of weighed visual information $\hat{v}_{t}$ to generate a word at each time step $t$ , which can be further reformulated as:

[TABLE]

where $W_{y}\in\mathbb{R}^{Q\times H}$ is a learned weight matrix. The output $p^{2}(y_{t}|y_{1:t-1})$ is the second distribution of words, which not only participates in loss optimization in training, but is used independently to sample word in testing. The distribution of the whole generated caption $y_{1:T}$ is calculated as the product of conditional distributions:

[TABLE]

3.2 Scene-based Factored Attention Module

In order to take full advantages of scene concepts and model hierarchical semantic concepts, we further propose a factorization method to embed scene concepts into the attention mechanism.

We firstly obtain diagonal matrix $S\in\mathbb{R}^{s\times s}$ by direct diagonalization of scene concepts $v_{scene}\in\mathbb{R}^{s}$ . Then this diagonal scene matrix $S$ is embedded into the LSTM hidden feature $h^{1}_{t}$ by factorizing the parameters $W_{h}$ in the traditional attention mechanism [51, 2] into three matrices $U_{h}$ , $S$ , $V_{h}$ :

[TABLE]

where $U_{h}\in\mathbb{R}^{M\times s}$ and $V_{h}\in\mathbb{R}^{s\times H}$ are the learned weight matrices that shared by all images and scene concepts.

The factored $W_{h}$ is used to transform the hidden feature $h^{1}_{t}$ , which fuels the context of the scene concepts directly. Therefore, the hidden feature $h^{1}_{t}$ obtains the context of the scene in this way. Given the regional features $v_{conv}=\{v_{1},...,v_{L}\},v_{i}\in\mathbb{R}^{C}$ , we generate first normalized attention weight $\alpha_{t}$ as follows:

[TABLE]

where $W_{va}\in\mathbb{R}^{H\times V}$ and $W_{a}\in\mathbb{R}^{H}$ are the learned weight matrices.

Similarly, given the object-centered visual concepts $v_{obj}\in\mathbb{R}^{V}$ , the second normalized attention weight $\beta_{t}$ is generated as follows:

[TABLE]

where $W_{vb}\in\mathbb{R}^{H\times V}$ and $W_{b}\in\mathbb{R}^{H}$ are the learned weight matrices.

Finally, the weighted regional features $\hat{v}_{conv,t}$ and the weighted object-centered visual concepts $\hat{v}_{obj,t}$ are concatenated via Eq. 23 and fed into the second LSTM in Eq. 11 and Eq. 12.

[TABLE]

3.3 Objective Function

Given a target ground-truth sequence $\bar{y}_{1:T}$ and a model with parameters $\theta$ , we minimize the following maximum likelihood estimation (MLE) loss:

[TABLE]

In order to regularize the first LSTM more directly, we calculate the loss for both LSTMs as:

[TABLE]

where $\gamma$ is the hyper-parameter between 0 and 1.

Finally, we also introduce the reinforcement learning (RL) method into our framework for fair comparison with recent RL-based works like [37, 2, 6, 30, 16].222It should be noted that our scene-based factored attention module can be broadly used in other RL-based methods or GAN-based methods [11, 7]. We minimize the negative expected reward after MLE training:

[TABLE]

where $y^{s}_{1:T}$ is a sampled caption and $r$ is the CIDEr [44] reward function. Similar negative expected reward function has been proven to be effective in other works [19, 37, 2].

Following the Self-critical Sequence Training (SCST) [37], the gradient of $L_{R}(\theta)$ can be approximated:

[TABLE]

where $y^{s}_{1:T}$ is a sampled caption and $\hat{y}_{1:T}$ defines the baseline score obtained by greedily decoding.

4 Experiments

In this section, we conduct extensive experiments to validate the effectiveness of scene-based factored attention module. In Section 4.1, we briefly introduce the dataset, images and captions pre-processing, evaluation metrics used in the experiments and implement details. Next, in Section 4.2, we discuss the ablation study of the proposed model. Then in Section 4.3, we compare and analyze the results of the proposed model with other state-of-the-art models on image captioning both offline and online. Finally, in Section 4.4, we qualitatively analyze our merits in details.

4.1 Experimental Settings

4.1.1 Dataset

In this paper, we utilize the MS COCO dataset [8], which has been far and wide used in image captioning training and evaluation. MS COCO dataset contains 123,827 images. Each image in the dataset is given at least five captions by different Amazon’s Mechanical Turk (AMT) workers. Following the Karpathy split333https://github.com/karpathy/neuraltalk in [22], we use a set of 113,287 images for training, 5K images for validation and 5K for testing.

4.1.2 Images and Captions Pre-processing

In the encoder-decoder framework, image encoder is an essential part of image captioning, which is used to extract the visual information of images. To totally understand the input image $I$ , we design three different kinds of visual information with hierarchical visual levels. The low-level is the region feature $V_{conv}=\{v_{1},...,v_{k}\},v_{i}\in\mathbb{R}^{C}$ extracted from the output of a Faster R-CNN [36] with ResNet-101 [18] like other methods in [2, 30, 16]. And note that the number of regional features varies from image to image. The middle-level is the object-centered visual concepts $V_{obj}\in\mathbb{R}^{V}$ , which extracted from a visual concept extractor CNN trained on MS COCO dataset [8]. We refer nouns from captions as our visual semantic concepts. We regard it as a multi-label classification problem by minimizing a label smoothing [42] element-wise logistic loss function. The high-level is the scene concepts $V_{scene}\in\mathbb{R}^{S}$ , which is extracted from a scene classifier CNN pretrained on Place dataset [56].

We follow standard practice and perform only minimal text-precessing. All the sentences in the training set are truncated to 16 characters, converting all sentences to lower case, tokenizing on white space, and filtering words that do not occur at least 5 times, resulting in a model vocabulary of 9,487 words.

4.1.3 Evaluation Metric

To evaluate the quantitative performance of the captions generated by our proposed model, we used five metrics which are commonly used in image captioning, including BLEU [34], METEOR [12], ROUGE [27], CIDEr [44] and SPICE [1]. All the results are evaluated by Microsoft COCO caption evaluation tool444https://github.com/tylin/coco-caption, where a larger score number in the results means better performance for all five metrics.

4.1.4 Implementation Details

We set the number of hidden units in each LSTM to 1,000, the number of hidden units in the attention layer to 512, and the size of the input word embedding to 1,000. In training, the Adam optimizer [23] with a learning rate initialized to 5e-4 and decay by a factor 0.8 for every three epochs. The batch size is 100. In testing, beam search is used to sample captions and the beam size is set to 2.

4.2 Ablation Study

In order to figure out the contribution of each component, we conduct the following ablation studies on the MS COCO dataset with Karpathy test split. Specifically, we remove the visual concepts (VC) and the proposed factored attention module (Scene) respectively from our model.

We summarized the experimental results in Tab. 1. The baseline is a re-implementation of Up-Down method proposed in [2]. The notation of ”VC” denotes that we add traditional visual concepts attention and the notation of ”Scene” denotes that we add factored attention module. The notation of ”[VC, Scene]” denotes that we concatenate the visual concepts and scene concepts as semantic attention. And the notation of ”Baseline + VC + Scene” is our full model, which denotes that the baseline model with our scene-based factored attention module.

With the results in Tab. 1, we can see that our model performs better than the baseline model with relative improvements range from 1.6% to 6.3%. With the guidance of scene concepts, the model makes better use of visual information. In addition, compared with ”Baseline + [VC, Scene]”, we can see that though adding scene cues in visual concepts attention helps the model choose words, this is not the optimal solution. ”Baseline + VC + Scene” obtains higher performance on all 5 metrics. This verifies the importance of our scene-based factored attention module.

In order to determine a hyper-parameters $\gamma$ as mentioned in the Eq. 25, we design an experiment with a variable-controlling approach. The objective results on the Karpathy test split with different $\gamma$ values are shown in Fig. 3. Notice that evaluation results achieve their optimal scores when $\gamma=3$ .

4.3 Comparing with State-of-the-Arts

In Tab. 2, we report the performance of our framework in comparison to the existing state-of-the-arts on the test portion of the Karpathy splits. For a fair comparison, results are reported for models trained with standard MLE loss in Tab. 2 (top), and models optimized for CIDEr score Tab. 2 (bottom). For offline evaluation, all the image captioning models are single-model with no fine-tuning of the input ResNet / R-CNN model. It is clear that our model performs the best on the generally used evaluation metrics, e.g., BLUE, ROUGE, CIDEr scores. The experimental results demonstrate that our proposed scene-based factored attention module can significantly boost the scores compared with the existing state-of-the-arts

We also compare our model to the recent results on the official MS COCO evaluation by uploading results to the online MS COCO test server. The online server provides ”C5” and ”C40” metrics which denote 5 reference captions and 40 reference captions, respectively. The results are summarized in Tab. 3, we can see that the performance of a single model trained with CIDEr optimization achieves the best performance on most metrics among the published state-of-the-art image captioning models on the blind test split.

4.4 Qualitative analysis

Here, we show some qualitative results in Fig. 4 for a better understanding of our proposed model. The notation of ”Detected” denote the scene concepts detected from the image. And notations of ”Ours w scene” and ”Ours wo scene” denote our proposed model with/without scene-based factored attention module. We can see that model with the proposed module pays more attention to the details of the scenes, and the proposed model is more inclined to mention the scene keywords in description generation.

We further visualize the heatmap of attention regions for words generated with/without scene-based factored attention module on the same image in Fig. 5. It is common practice [51, 2] to directly visualize the attention weights $\alpha_{t}$ in Eq. 18 associated with word emitted at the same time step $t$ . We can find out that the area of attention is more clear with using the scene semantic concepts as guidance. In the complex scene as shown in the top of Fig. 5, it can pay more clearly and discriminately attention to regional features and tends to describe the scene more. In a relatively simple scene, as shown in the bottom of Fig. 5, the attention weights generated by our model are more logical, indicating that they are more accurate for the application of regional features of images. As captions are being generated, the attention weights at both image examples vary properly when words generated.

5 Conclusions

In this work, we propose a novel scene-based factored attention module for image captioning. Different from previous works based on either regional features attention or object-centered visual concepts attention, our model takes scene concepts into account. As far as we know, we are the first to take scene concepts into consideration in image captioning and model relevance among scene concepts, object-centered visual concepts and caption generation. In our proposed scene-based factored attention module, we explicitly embed scene concepts in factored tensor into the LSTM hidden feature. Conditioned on the scene embedded hidden feature, we get the relative importance of regional features and object-centered visual concepts. The real power of our proposed module lies in its ability to attend hierarchically visual information for better captions. Experiments conducted on the MS COCO captioning datasets validate the superiority of the proposed approach.

Acknowledgement

This work is supported by the National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), Nature Science Foundation of China (No.U1705262, No.61772443, and No.61572410), Post Doctoral Innovative Talent Support Program under Grant BX201600094, China Post-Doctoral Science Foundation under Grant 2017M612134, Scientific Research Project of National Language Committee of China (Grant No. YB135-49), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision , pages 382–398. Springer, 2016.
2[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6077–6086, 2018.
3[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 , 2014.
4[4] Fuhai Chen, Rongrong Ji, Jinsong Su, Yongjian Wu, and Yunsheng Wu. Structcap: Structured semantic embedding for image captioning. In Proceedings of the 25th ACM international conference on Multimedia , pages 46–54. ACM, 2017.
5[5] Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, and Jinsong Su. Groupcap: Group-based image captioning with structured relevance and diversity constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1345–1353, 2018.
6[6] Shi Chen and Qi Zhao. Boosted attention: Leveraging human attention for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 68–84, 2018.
7[7] Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang, Wan-Ting Hsu, Jianlong Fu, and Min Sun. Show, adapt and tell: Adversarial training of cross-domain image captioner. In Proceedings of the IEEE International Conference on Computer Vision , pages 521–530, 2017.
8[8] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. ar Xiv preprint ar Xiv:1504.00325 , 2015.