Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Hao Chen; Fang Qiu; Fangchao Dong; Defei Yang; Eve Bohnett; Li An

arXiv:2604.06124·cs.CV·April 8, 2026

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Hao Chen, Fang Qiu, Fangchao Dong, Defei Yang, Eve Bohnett, Li An

PDF

TL;DR

This paper introduces a lightweight multimodal adaptation framework that effectively transfers RGB-pretrained vision language models to thermal drone imagery, enabling species recognition and habitat context interpretation.

Contribution

The study develops a novel multimodal projector alignment method for adapting vision language models to thermal imagery, demonstrating improved ecological monitoring capabilities.

Findings

01

Qwen3-VL-8B-Instruct with open-set prompting achieved F1 scores above 0.9 for species recognition.

02

The models accurately enumerate instances with within-1 accuracy up to 1.0.

03

Combining thermal and RGB imagery enables habitat and landscape feature interpretation.

Abstract

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.