Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models
Yakoub Bazi, Mohamad M. Al Rahhal, Mansour Zuair, and Faroun Mohamed

TL;DR
This paper evaluates recent multimodal vision-language models on Change VQA in remote sensing, finding native multimodal models outperform structured pipelines and that larger models do not always yield better results.
Contribution
It provides a comparative analysis of Qwen models on Change VQA, highlighting the effectiveness of native multimodal architectures over structured pipelines.
Findings
Native multimodal models outperform structured vision-language pipelines.
Model size does not monotonically improve performance.
Tightly integrated multimodal backbones are more effective for semantic change reasoning.
Abstract
Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
