# DR-CLIP: A Deformable Vision–Language Model for Scale-Invariant Object Counting in Remote Sensing Images

**Authors:** Jingzhe Nie, Qun Liu, Tianze Li, Xu Lu, Liang Zhang

PMC · DOI: 10.3390/s26061863 · Sensors (Basel, Switzerland) · 2026-03-16

## TL;DR

DR-CLIP is a vision–language model that improves object counting in remote sensing images by handling scale variations and diverse annotations using a deformable attention mechanism and unified training format.

## Contribution

Introduces DR-CLIP with R2I and MSDA modules for scalable, open-vocabulary object counting in remote sensing.

## Key findings

- DR-CLIP achieves a MAE of 2.34 and RMSE of 3.89 on DOTA-v2.0, outperforming baselines by 19.0% in MAE.
- The MSDA module increases Small-Object Recall (SOR) to 0.824, improving dense and small object counting.
- DR-CLIP shows strong cross-domain generalization with only 8.7% performance degradation, compared to 23.4% in baselines.

## Abstract

What are the main findings?
Proposed a Region-to-Instruction (R2I) mechanism that unifies heterogeneous annotations (points, boxes, polygons) into a standardized image–text format for scalable vision–language training.Developed a Multi-scale Deformable Attention (MSDA) module that dynamically adjusts receptive fields to enhance feature extraction across extreme scale variations and cluttered backgrounds in remote sensing images.

Proposed a Region-to-Instruction (R2I) mechanism that unifies heterogeneous annotations (points, boxes, polygons) into a standardized image–text format for scalable vision–language training.

Developed a Multi-scale Deformable Attention (MSDA) module that dynamically adjusts receptive fields to enhance feature extraction across extreme scale variations and cluttered backgrounds in remote sensing images.

What are the implications of the main findings?
The DR-CLIP framework achieves robust cross-modal alignment and open-vocabulary counting capability, enabling flexible object quantification from natural language queries without category-specific retraining.The method demonstrates strong cross-domain generalization and maintains practical inference efficiency, making it suitable for deployment in diverse and complex remote sensing scenarios.

The DR-CLIP framework achieves robust cross-modal alignment and open-vocabulary counting capability, enabling flexible object quantification from natural language queries without category-specific retraining.

The method demonstrates strong cross-domain generalization and maintains practical inference efficiency, making it suitable for deployment in diverse and complex remote sensing scenarios.

Object counting in remote sensing images is valuable for applications such as urban planning and environmental monitoring. However, it remains challenging due to heterogeneous annotations, semantic ambiguity in open-vocabulary queries, and performance degradation of small targets. To address these limitations, we propose DR-CLIP (Deformable Remote CLIP), a vision–language model for remote sensing image counting that incorporates deformable visual feature extraction with text-guided prediction. DR-CLIP includes a (1) Region-to-Instruction (R2I) mechanism to convert points, bounding boxes, and polygons into a unified image–text training representation, a (2) Multi-scale Deformable Attention (MSDA) to enhance discriminative feature extraction across extreme scale variations and cluttered backgrounds, and a (3) Text-Guided Counting Head that establishes robust cross-modal alignment through contrastive learning, achieving open-vocabulary counting capability without category-specific retraining. On DOTA-v2.0, DR-CLIP achieves a Mean Absolute Error (MAE) of 2.34 and a Root Mean Squared Error (RMSE) of 3.89, outperforming baselines by 19.0% in MAE. The MSDA module significantly increases Small-Object Recall (SOR) to 0.824, which is especially effective in situations involving dense and small object counting. In cross-modal retrieval, DR-CLIP attains R@1 scores of 68.3% (image-to-text) and 72.1% (text-to-image) on the Remote Sensing Image Captioning Dataset (RSICD). The framework generalizes robustly, with only 8.7% performance degradation in cross-domain tests, which is significantly lower than the 23.4% drop observed in baseline methods.

## Full-text entities

- **Chemicals:** DOTA (MESH:C071349), DR-CLIP (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13030177/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13030177/full.md

## References

59 references — full list in the complete paper: https://tomesphere.com/paper/PMC13030177/full.md

---
Source: https://tomesphere.com/paper/PMC13030177