Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for   Change Captioning

Xiangxi Shi; Xu Yang; Jiuxiang Gu; Shafiq Joty; and Jianfei Cai

arXiv:2009.14352·cs.CV·October 1, 2020·6 cites

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Xiangxi Shi, Xu Yang, Jiuxiang Gu, Shafiq Joty, and Jianfei Cai

PDF

Open Access

TL;DR

This paper introduces a viewpoint-adapted matching encoder for change captioning that effectively distinguishes viewpoint changes from semantic differences, improving accuracy in describing image changes.

Contribution

The paper proposes a novel encoder that explicitly separates viewpoint and semantic changes and employs reinforcement learning to align attention with language evaluation, advancing change captioning methods.

Findings

01

Outperforms state-of-the-art on Spot-the-Diff dataset

02

Outperforms state-of-the-art on CLEVR-Change dataset

03

Effectively distinguishes viewpoint from semantic changes

Abstract

Change Captioning is a task that aims to describe the difference between images with natural language. Most existing methods treat this problem as a difference judgment without the existence of distractors, such as viewpoint changes. However, in practice, viewpoint changes happen often and can overwhelm the semantic difference to be described. In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. Moreover, we further simulate the attention preference of humans and propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards. Extensive experimental results show that our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques