SAMChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Small Scale Remote Sensing

Aybora Koksal; A. Aydin Alatan

arXiv:2505.07984·cs.CV·December 1, 2025

SAMChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Small Scale Remote Sensing

Aybora Koksal, A. Aydin Alatan

PDF

TL;DR

SAMChat is a lightweight multimodal model tailored for remote sensing imagery analysis, utilizing chain-of-thought reasoning and GRPO to improve detection of military sites with high accuracy and interpretability.

Contribution

The paper introduces SAMChat, a resource-efficient multimodal model with specialized dataset, chain-of-thought reasoning, and GRPO, advancing remote sensing analysis in resource-constrained settings.

Findings

01

Achieved over 80% recall and 98% precision on SAMData benchmark.

02

Outperformed larger general-purpose models in captioning and classification tasks.

03

Demonstrated effectiveness of fine-tuning and reinforcement learning in domain-specific applications.

Abstract

Remarkable capabilities in understanding and generating text-image content have been demonstrated by recent advancements in multimodal large language models (MLLMs). However, their effectiveness in specialized domains-particularly those requiring resource-efficient and domain-specific adaptations-has remained limited. In this work, a lightweight multimodal language model termed SAMChat is introduced, specifically adapted to analyze remote sensing imagery in secluded areas, including challenging missile launch sites. A new dataset, SAMData, was compiled by verifying hundreds of aerial images through expert review, and subtle military installations were highlighted via detailed captions. Supervised fine-tuning on a 2B parameter open-source MLLM with chain-of-thought (CoT) reasoning annotations was performed, enabling more accurate and interpretable explanations. Additionally, Group…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.