Self-supervised Cross-view Representation Reconstruction for Change   Captioning

Yunbin Tu; Liang Li; Li Su; Zheng-Jun Zha; Chenggang Yan; Qingming; Huang

arXiv:2309.16283·cs.CV·September 29, 2023

Self-supervised Cross-view Representation Reconstruction for Change Captioning

Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming, Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SCORER, a self-supervised network that learns stable, view-invariant difference representations for change captioning by cross-view contrastive learning and cross-attention reconstruction, achieving state-of-the-art results.

Contribution

The paper proposes a novel self-supervised framework with multi-head token-wise matching and cross-view contrastive learning for stable difference representation in change captioning.

Findings

01

Achieves state-of-the-art results on four datasets.

02

Effectively learns view-invariant representations.

03

Improves caption quality through backward reasoning.

Abstract

Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tuyunbin/scorer
pytorchOfficial

Videos

Self-supervised Cross-view Representation Reconstruction for Change Captioning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization