# HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

**Authors:** Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, and Shaozi Li

arXiv: 2508.21539 · 2025-09-01

## TL;DR

HCCM introduces a hierarchical learning framework for natural language-guided drones that enhances vision-language understanding and compositional reasoning in dynamic environments, outperforming existing models on multiple benchmarks.

## Contribution

The paper proposes HCCM, a novel hierarchical contrastive and matching learning framework that captures local-to-global semantics without strict scene partitioning, improving robustness and zero-shot generalization.

## Key findings

- Achieves state-of-the-art Recall@1 of 28.8% in image retrieval.
- Demonstrates strong zero-shot generalization with 39.93% mean recall on ERA dataset.
- Outperforms fine-tuned baselines in diverse drone scenarios.

## Abstract

Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21539/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21539/full.md

## References

50 references — full list in the complete paper: https://tomesphere.com/paper/2508.21539/full.md

---
Source: https://tomesphere.com/paper/2508.21539