Leveraging Textual Compositional Reasoning for Robust Change Captioning
Kyu Ri Park, Jiyoung Park, Seong Tae Kim, Hong Joo Lee, Jung Uk Kim

TL;DR
This paper introduces CORTEX, a framework that combines visual and textual cues, especially from Vision Language Models, to improve the accuracy of change captioning by capturing subtle and compositional differences.
Contribution
CORTEX is the first framework to integrate scene-level textual knowledge with visual features for robust change captioning, enhancing reasoning over subtle changes.
Findings
Improved change captioning accuracy over baseline models.
Effective integration of textual cues from VLMs enhances reasoning.
Better detection of subtle and compositional changes.
Abstract
Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Language, Metaphor, and Cognition
