Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding
Xiaonan Lu, Jianlong Yuan, Ruigang Niu, Yuan Hu, Fan Wang

TL;DR
This paper introduces a novel method combining viewpoint registration and adapter-based encoding to enhance vision language models for image change understanding, effectively capturing nuances between images despite viewpoint variations.
Contribution
The paper proposes a viewpoint integration and registration approach with a fused adapter encoder, improving ICU performance of VLFMs under viewpoint changes.
Findings
Achieves state-of-the-art results on CLEVR-Change and Spot-the-Diff datasets.
Effectively reduces viewpoint variation impact on change detection.
Enhances nuanced understanding between image pairs.
Abstract
Recently, the development of pre-trained vision language foundation models (VLFMs) has led to remarkable performance in many tasks. However, these models tend to have strong single-image understanding capability but lack the ability to understand multiple images. Therefore, they cannot be directly applied to cope with image change understanding (ICU), which requires models to capture actual changes between multiple images and describe them in language. In this paper, we discover that existing VLFMs perform poorly when applied directly to ICU because of the following problems: (1) VLFMs generally learn the global representation of a single image, while ICU requires capturing nuances between multiple images. (2) The ICU performance of VLFMs is significantly affected by viewpoint variations, which is caused by the altered relationships between objects when viewpoint changes. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsAdapter
