Describing and Localizing Multiple Changes with Transformers
Yue Qiu, Shintaro Yamamoto, Kodai Nakashima, Ryota Suzuki and, Kenji Iwata, Hirokatsu Kataoka, Yutaka Satoh

TL;DR
This paper introduces MCCFormers, a transformer-based model for detecting and describing multiple changes in image pairs, supported by a new dataset and benchmark results showing significant improvements over existing methods.
Contribution
The paper presents a novel multi-change captioning transformer model, a simulation dataset, and benchmarks that demonstrate superior performance in multi-change detection and description tasks.
Findings
MCCFormers achieved the highest scores on four evaluation metrics.
The method effectively separates attention maps for each change.
Outperformed previous state-of-the-art on CLEVR-Change benchmark.
Abstract
Change captioning tasks aim to detect changes in image pairs observed before and after a scene change and generate a natural language description of the changes. Existing change captioning studies have mainly focused on a single change.However, detecting and describing multiple changed parts in image pairs is essential for enhancing adaptability to complex scenarios. We solve the above issues from three aspects: (i) We propose a simulation-based multi-change captioning dataset; (ii) We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning; (iii) We further propose Multi-Change Captioning transformers (MCCFormers) that identify change regions by densely correlating different regions in image pairs and dynamically determines the related change regions with words in sentences. The proposed method obtained the highest scores on four conventional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
