HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

Hao Ruan; Jinliang Lin; Yingxin Lai; Zhiming Luo; and Shaozi Li

arXiv:2508.21539·cs.CV·September 1, 2025

HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones

Hao Ruan, Jinliang Lin, Yingxin Lai, Zhiming Luo, and Shaozi Li

PDF

Open Access

TL;DR

HCCM introduces a hierarchical learning framework for natural language-guided drones that enhances vision-language understanding and compositional reasoning in dynamic environments, outperforming existing models on multiple benchmarks.

Contribution

The paper proposes HCCM, a novel hierarchical contrastive and matching learning framework that captures local-to-global semantics without strict scene partitioning, improving robustness and zero-shot generalization.

Findings

01

Achieves state-of-the-art Recall@1 of 28.8% in image retrieval.

02

Demonstrates strong zero-shot generalization with 39.93% mean recall on ERA dataset.

03

Outperforms fine-tuned baselines in diverse drone scenarios.

Abstract

Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching…

Tables4

Table 1. Table 1. Comparative performance evaluation of cross-modal retrieval methods on the GeoText-1652 benchmark. Results are presented using Recall@K (R@K) for both Image Query (Drone-view Geolocalization) and Text Query (Drone Navigation) tasks, under zero-shot and fine-tuned settings. † \dagger denotes results are reproduced by the provided source code. The best performances are in bold .

Method	Params	Pretrained	Image Query(%)			Text Query(%)
Method	Params	Images	R@1	R@5	R@10	R@1	R@5	R@10
Zero-Shot Evaluation on GeoText-1652
UNITER (Chen et al., 2020)	300M	4M	2.5	7.4	11.8	0.9	2.7	4.2
METER-Swin (Dou et al., 2022)	380M	4M	2.7	8.0	12.2	1.3	3.9	5.8
ALBEF (Li et al., 2021a)	210M	4M	2.9	8.1	12.4	1.8	4.8	7.1
ALBEF (Li et al., 2021a)	210M	14M	3.0	9.1	14.2	1.1	3.5	5.3
XVLM (Zeng et al., 2022)	216M	4M	4.9	14.2	21.1	4.3	9.1	13.2
XVLM (Zeng et al., 2022)	216M	16M	5.0	14.4	21.4	4.5	9.9	13.4
Fine-Tuned Evaluation on GeoText-1652
HyCoCLIP (Pal et al., 2025) $†$	216M	16M	15.3	33.6	43.2	8.7	15.8	20.0
UNITER (Chen et al., 2020)	300M	4M	21.4	43.4	59.5	10.6	20.4	26.1
METER-Swin (Dou et al., 2022)	380M	4M	22.7	46.3	60.7	11.3	21.5	27.3
ALBEF (Li et al., 2021a)	210M	4M	22.9	49.5	62.3	12.3	22.8	28.6
ALBEF (Li et al., 2021a)	210M	14M	23.2	49.7	62.4	12.5	22.8	28.5
XVLM (Zeng et al., 2022)	216M	4M	23.6	50.0	63.2	13.1	23.5	29.2
XVLM (Zeng et al., 2022)	216M	16M	25.0	52.3	65.1	13.2	23.7	29.6
GeoText-1652 (Chu et al., 2024)	217M	16M	26.3	53.7	66.9	13.6	24.6	31.2
HCCM	216M	16M	28.8	57.3	69.9	14.7	26.0	32.5

Table 2. Table 2. Ablation study on the GeoText-1652 dataset evaluating the contribution of individual components of our proposed HCCM method. MC and MD denote the proposed momentum contrast and momentum distillation, while RG-ITC and RG-ITM represent our region-global image text contrastive learning and region-global image text matching learning.

Components				Image Query(%)			Text Query(%)
MC	MD	RG-ITC	RG-ITM	R@1	R@5	R@10	R@1	R@5	R@10
				25.51	52.70	65.54	12.84	23.07	29.27
$✓$				26.48	54.10	66.78	13.75	24.91	31.49
$✓$	$✓$			26.86	54.96	67.83	14.11	25.13	31.77
		$✓$		27.01	54.63	67.05	13.63	24.24	30.50
		$✓$	$✓$	27.04	54.91	67.52	14.04	24.69	30.97
$✓$	$✓$	$✓$		27.32	55.78	68.53	14.15	25.21	31.81
$✓$	$✓$		$✓$	26.89	54.86	67.72	14.41	25.77	32.28
$✓$	$✓$	$✓$	$✓$	28.82	57.30	69.93	14.73	25.98	32.49

Table 3. Table 3. Ablation study of internal components in RG-ITC and RG-ITM to explore the impact of removing cross-modal directional losses on the performance of models.

Method	Image Query (%)			Text Query (%)
Method	R@1	R@5	R@10	R@1	R@5	R@10
HCCM	28.82	57.30	69.93	14.73	25.98	32.49
$- ℒ_{R G - I T C} (I_{i, k} \to T_{i})$	27.62	56.22	68.90	14.43	25.46	32.02
$- ℒ_{R G - I T C} (T_{i, k} \to I_{i})$	26.89	54.86	67.72	14.41	25.77	32.28
$- ℒ_{R G - I T M} (I_{i, k} \leftrightarrow T_{i})$	27.19	55.37	68.33	14.47	25.37	31.80
$- ℒ_{R G - I T M} (T_{i, k} \leftrightarrow I_{i})$	26.86	54.96	67.83	14.11	25.13	31.77

Table 4. Table 4. Comparison of fine-tuned results and zero-shot results on the ERA dataset. mR denotes mean Recall.

Method	Image Retrieval (%)			Text Retrieval (%)			mR (%)
Method	R@1	R@5	R@10	R@1	R@5	R@10	mR (%)
Reported Fine-tuned Results on ERA Dataset
VSE++ (Faghri et al., 2018)	10.13	35.20	53.91	9.79	30.40	42.90	30.39
PVSE K=2 (Song and Soleymani, 2019)	11.04	35.57	51.65	11.31	32.60	46.95	31.52
PVSE K=1 (Song and Soleymani, 2019)	11.14	36.08	53.75	9.96	33.95	47.97	32.14
CLIP (Radford et al., 2021b)	12.73	37.33	51.52	11.31	31.92	43.91	31.45
PCME (Chun et al., 2021)	13.85	42.87	60.64	14.69	35.30	49.15	36.08
AMFMN-soft (Yuan et al., 2022a)	14.18	46.79	62.87	14.35	38.01	52.02	38.04
AMFMN-sim (Yuan et al., 2022a)	13.75	43.41	59.59	14.02	34.12	51.52	36.06
AMFMN-fusion (Yuan et al., 2022a)	11.62	42.26	60.51	15.20	36.99	50.33	36.15
GALR (Yuan et al., 2022b)	14.03	45.15	64.54	12.38	36.59	50.90	37.27
VCSR (Huang et al., 2024)	13.69	46.31	66.37	15.65	38.28	53.49	38.96
Zero-shot Results on ERA Dataset (Fine-tuned on GeoText-1652)
HyCoCLIP (Pal et al., 2025)	7.77	17.57	23.31	8.04	18.65	21.96	16.22
XVLM (Zeng et al., 2022)	14.19	36.15	50.34	14.05	39.46	53.99	34.70
GeoText-1652 (Chu et al., 2024)	17.91	39.19	54.73	17.09	42.30	56.76	38.00
HCCM	19.93	39.19	56.76	18.58	45.20	59.93	39.93

Equations32

L_{R G - I T C} =

L_{R G - I T C} =

\displaystyle+\log\frac{\exp(s(z_{t}^{(i,k)},z_{v}^{m(i)})/\tau)}{\sum\limits_{j=1}^{N}\exp(s(z_{t}^{(i,k)},z_{v}^{m(j)})/\tau)}\Biggr{]},

L_{I T M}

L_{I T M}

\displaystyle\quad+(1-y)\log p_{match}(I,T)[0]\Bigr{]},

h_{r v - g t}^{+ (i, k)}

h_{r v - g t}^{+ (i, k)}

h_{g v - r t}^{+ (i, k)}

p (j^{'} ∣ i, k)

p (j^{'} ∣ i, k)

p (l^{'} ∣ i, k)

h_{r v - g t}^{- (i, k)}

h_{r v - g t}^{- (i, k)}

h_{g v - r t}^{- (i, k)}

p_{\text{match}}^{(h)}=\text{softmax}\bigl{(}H_{\text{match}}(h)\bigr{)}\in\mathbb{R}^{2}.

p_{\text{match}}^{(h)}=\text{softmax}\bigl{(}H_{\text{match}}(h)\bigr{)}\in\mathbb{R}^{2}.

\displaystyle\mathcal{L}_{\text{RG-ITM}}=-\frac{1}{|\mathcal{H}_{\text{RG}}|}\sum_{h\in\mathcal{H}_{\text{RG}}}\Bigl{[}\,

\displaystyle\mathcal{L}_{\text{RG-ITM}}=-\frac{1}{|\mathcal{H}_{\text{RG}}|}\sum_{h\in\mathcal{H}_{\text{RG}}}\Bigl{[}\,

\displaystyle+(1-y^{(h)})\log p_{\text{match}}^{(h)}[0]\Bigr{]},

q_{i 2 t}^{(i)}

q_{i 2 t}^{(i)}

q_{t 2 i}^{(i)}

L_{I T C_{M C D}} =

L_{I T C_{M C D}} =

\displaystyle+H\left(q_{t2i}^{(i)},\text{softmax}(z_{t}^{(i)\top}Z_{v}^{m}/\tau)\right)\Big{]}

L_{B o x}

L_{B o x}

L_{t o t a l} =

L_{t o t a l} =

+ w_{r g - i t c} L_{R G - I T C} + w_{r g - i t m} L_{R G - I T M}

+ w_{b o x} L_{B o x} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

Full text

HCCM: Hierarchical Cross-Granularity Contrastive and

Matching Learning for Natural Language-Guided Drones

Hao Ruan

Department of Artificial Intelligence,Xiamen UniversityXiamenChina

[email protected]

,

Jinliang Lin

Department of Artificial Intelligence,Xiamen UniversityXiamenChina

[email protected]

,

Yingxin Lai

Department of Artificial Intelligence,Xiamen UniversityXiamenChina

[email protected]

,

Zhiming Luo

Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China,Xiamen UniversityXiamenChina

[email protected]

and

Shaozi Li

Fujian Key Laboratory of Big Data Application and Intellectualization for Tea Industry,Wuyi UniversityWuyishanChina

[email protected]

(2025)

Abstract.

Natural Language-Guided Drones (NLGD) offer a novel and flexible interaction paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantic relationships inherent in drone scenarios place greater demands on visual language understanding. First, mainstream Vision-Language Models (VLMs) primarily focus on global feature alignment and lack fine-grained semantic understanding. Second, existing hierarchical semantic modeling methods rely on precise entity partitioning and strict containment relationship constraints, which limits their effectiveness in complex drone environments. To address these challenges, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework, comprising two core components:

Region-Global Image-Text Contrastive Learning (RG-ITC). Avoiding precise scene entity partitioning, RG-ITC models hierarchical local-to-global cross-modal semantics by contrasting local visual regions with global text semantics, and vice versa. 2) Region-Global Image-Text Matching Learning (RG-ITM). Instead of relying on strict relationship constraints, this component evaluates local semantic consistency within global cross-modal representations, improving the comprehension of complex compositional semantics. Furthermore, drone scenario textual descriptions are often incomplete or ambiguous, destabilizing global semantic alignment. To mitigate this, HCCM incorporates a Momentum Contrast and Momentum Distillation (MCD) mechanism, enhancing alignment robustness. Extensive experiments on the GeoText-1652 benchmark demonstrate HCCM significantly outperforms existing methods, achieving state-of-the-art Recall@1 scores of 28.8% (image retrieval) and 14.7% (text retrieval). Moreover, HCCM exhibits strong zero-shot generalization on the unseen ERA dataset, achieving 39.93% mean recall (mR), surpassing evaluated fine-tuned models. These results highlight the effectiveness and robustness of HCCM across diverse scenarios. Our implementation is available at https://github.com/rhao-hur/HCCM.

Natural Language-Guided Drones, Cross-Modal Retrieval, Compositional Semantics

††journalyear: 2025††copyright: acmlicensed††conference: Proceedings of the 33rd ACM International Conference on Multimedia; October 27–31, 2025; Dublin, Ireland††booktitle: Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), October 27–31, 2025, Dublin, Ireland††doi: 10.1145/3746027.3755489††isbn: 979-8-4007-2035-2/2025/10††ccs: Information systems Image search††ccs: Computing methodologies Visual content-based indexing and retrieval

1. Introduction

In recent years, the application of Unmanned Aerial Vehicles (UAVs) has expanded from basic image acquisition to include complex tasks such as agricultural monitoring (Gago et al., 2015; Kim et al., 2019; Tripicchio et al., 2015; Wardihani et al., 2018), target tracking (Zhang et al., 2022; Chen et al., 2022; Nousi et al., 2019), and cross-view target matching (Zheng et al., 2020; Lin et al., 2022). Among these, cross-view target matching has emerged as a crucial task, aiming to locate targets by matching images captured from different perspectives (e.g., UAV, satellite, ground), which is often formulated as an image retrieval problem. However, relying solely on visual queries encounters challenges in cross-view target matching: performance is susceptible to variations in illumination, weather, and viewpoint changes, leading to degradation (Wang et al., 2024; Lin et al., 2024). Furthermore, visual queries may not always be available in practical applications. Consequently, leveraging Natural Language-Guided Drones (NLGD) for target matching has emerged as a significant research direction, owing to their flexible querying approach and integrated language understanding capabilities (Chu et al., 2024).

Recent research in Natural Language-Guided Drones (NLGD) shows significant progress. To support research in the NLGD task, Chu et al. (Chu et al., 2024) introduced a large-scale NLGD dataset, GeoText-1652, and defined two core subtasks: UAV Text Navigation (text-guided UAV positioning) and UAV View Target Localization (matching descriptions to UAV views for target identification). They employed Vision-Language Models (VLMs) (Radford et al., 2021a; Chen et al., 2020; Li et al., 2022, 2021a; Zeng et al., 2022; Zhang et al., 2025) with contrastive learning to align global image-text representations in a shared embedding space. Concurrently, Huang et al. (Huang et al., 2024) introduced the ERA and UDV datasets for NLGD, but explored non-VLM approaches. They utilized Convolutional Neural Networks (CNNs) (Wang et al., 2017; Radoi and Datcu, 2019; Zhang et al., 2019) and Bidirectional Gated Recurrent Units (Bi-GRUs) (Cho et al., 2014) for visual-language encoding and developed the Text-Guided Visual Information Reasoning (TGVIR) mechanism for fine-grained cross-modal semantic alignment.

However, NLGD task requires handling queries with compositional semantics, while current methods (Chu et al., 2024; Huang et al., 2024) often exhibit poor generalization for compositional understanding and fail to grasp cross-granularity semantic hierarchies. As illustrated in Figure 1(a), let $I_{i},T_{i}$ be the global image/text, $I_{i,k},T_{i,k}$ the local region image/text (solid box/sentence), and $I_{i,k}^{e},T_{i,k}^{e}$ the fine-grained entity image/text (dashed box/phrase). A global image $I_{i}$ typically contains multiple entity-level semantic regions (e.g., $I_{i,1}^{e},I_{i,2}^{e},I_{i,3}^{e}$ ) corresponding to entity descriptions (e.g., $T_{i,1}^{e}$ , $T_{i,2}^{e}$ , $T_{i,3}^{e}$ ) within the global text $T_{i}$ . The compositional interplay of these regions defines the scene’s semantics. Accurate scene distinction or instruction execution requires understanding the compositional roles of local regions within the global context. VLMs relying on global semantic alignment, often lack this fine-grained understanding. Furthermore, traditional sequence models (Cho et al., 2014) generalize poorly when interpreting complex compositional relationships, particularly in longer texts (Liu et al., 2020).

Recognizing the need to model relationships across different granularities, some VLM-based methods explore specific cross-modal interactions. For instance, Pal et al. (Pal et al., 2025) proposed Compositional Entailment Learning (Fig. 1(b)), modeling part-whole hierarchies via cross-modal contrastive learning ( $I_{i,k}^{e}\to T_{i}$ , $T_{i,k}^{e}\to I_{i}$ ) within hyperbolic space. This approach leverages semantic entailment learning (Le et al., 2019), assuming more abstract entities ( $I_{i,k}^{e},T_{i,k}^{e}$ ) entail the global concrete concepts ( $I_{i},T_{i}$ ). Typically, textual descriptions tend to express more abstract concepts than images. Formally, A entails B is defined as $B\subset A$ , implying intra-modal relations $T_{i}\subset T_{i,k}^{e}$ and $I_{i}\subset I_{i,k}^{e}$ , and inter-modal relations $I_{i,k}^{e}\subset T_{i,k}^{e}$ and $I_{i}\subset T_{i}$ . However, applying such strict, entailment-based hierarchies proves challenging for UAV bird’s-eye views. UAV imagery frequently features complexly intertwined elements (e.g., the Z-shaped road in Fig. 1(a)) and widely distributed similar elements (e.g., trees), resisting clear delineation into discrete entities suitable for rigid decomposition. Moreover, UAV scene descriptions often prioritize element co-occurrence and composition over strict semantic entailment. Consequently, the geometric constraints imposed by entailment learning (e.g., entailment cone loss (Ganea et al., 2018)) may be overly restrictive for flexibly capturing the compositional semantics inherent in UAV scenario.

To address the above issues, we propose the Hierarchical Cross-Granularity Contrastive and Matching Learning (HCCM) method. Building upon the standard cross-modal contrastive and matching learning framework (Zeng et al., 2022), HCCM introduces Region-Global Image Text Contrastive Learning (RG-ITC) (Figure 1(c)). Unlike methods relying on entity partitioning, RG-ITC models semantic associations across granularities, specifically linking unimodal local information (image region $I_{i,k}$ or text fragment $T_{i,k}$ ) with the corresponding global representation of the other modality (text $T_{i}$ and image $I_{i}$ ). This aims to capture the local-to-global cross-modal semantic hierarchical relationships within UAV scenarios. Furthermore, distinct from approaches modeling strict parent-child or part-whole relationships, we introduce Region-Global Image Text Matching Learning (RG-ITM) (Figure 1(d)) to enhance the model’s ability to discern the semantic consistency between local details and the global context across modalities. Specifically, it assesses whether the semantic content derived from a unimodal local region ( $I_{i,k}$ or $T_{i,k}$ ) is consistent with the corresponding global representation of the other modality ( $T_{i}$ or $I_{i}$ ). This process improves the model’s comprehension and discrimination of complex spatial layouts and intertwined semantics.

However, directly applying this strategy in drone scenarios encounters certain limitations. Large-scale views often yield incomplete or ambiguous $T_{i}$ , causing local-global alignment to amplify noise bias (Arpit et al., 2017), impairing global performance. To mitigate this, we introduce Momentum Contrast and Momentum Distillation (MCD), employing negative queues and soft targets respectively to stabilize global alignment and enhance interference resistance. By combining RG-ITC, RG-ITM and MCD, our proposed HCCM can effectively improve the performance of VLM in UAV scenarios.

In summary, the main contributions of this paper are as follows:

(1)

A Hierarchical Cross-Granularity Contrastive and Matching Learning (HCCM) framework is presented to address insufficient fine-grained feature alignment and difficulty in modeling hierarchical relationships in Natural Language-Guided Drone tasks. 2. (2)

Region-Global Image Text Contrastive Learning (RG-ITC) is designed to model cross-granularity hierarchies, and Region-Global Image Text Matching Learning (RG-ITM) is proposed to enhance composite semantic understanding. 3. (3)

A Momentum Contrast and Momentum Distillation (MCD) strategy is introduced to mitigate noise amplification from incomplete text descriptions. 4. (4)

Experiments on GeoText-1652 and ERA datasets validate the effectiveness and robustness of the proposed method.

2. Related Work

2.1. Vision and Language Navigation

Using natural language descriptions for positioning and navigation can enhance navigation efficiency, which has attracted the attention of researchers. For retrieving corresponding satellite images based on scene text descriptions, Ye et al. (Ye et al., 2024) proposed a text-based localization method, CrossText2Loc, which excels in handling long texts and interpretability. Xia et al. (Xia et al., 2024) proposed a Self-Attention Pooling (SAP) module to integrate data from multiple modalities, including natural language, images, and point clouds, to achieve cross-modal place recognition. To navigate drones through natural language commands, Chu et al.(Chu et al., 2024) introduced a natural language-guided UAV geolocalization benchmark, GeoText-1652, and proposed a blending spatial matching for region-level spatial relation matching. In addition, Huang et al.(Huang et al., 2024) utilized textual cues through Contextual Region Learning (CRL) and Consistency Semantic Alignment (CSA) mechanisms to guide the model in overcoming challenges related to context understanding and alignment in UAV images. Unlike existing methods, our approach primarily focuses on addressing the issue of insufficient fine-grained alignment in drone scenarios, which has been overlooked by existing methods.

2.2. Visual Language Model for Feature Alignment

Vision-Language Models (VLMs) aim to learn joint representations of images and text. CLIP (Radford et al., 2021a) laid the groundwork using large-scale contrastive learning, followed by advancements like UNITER’s image-text matching (ITM) (Chen et al., 2020), ALBEF’s ”align-before-fuse” strategy with hard negative mining (Li et al., 2021a), and X-VLM’s focus on multi-level concept alignment (Zeng et al., 2022). Architecturally, METER (Dou et al., 2022) assessed various encoders and fusion strategies, while the BLIP series (Li et al., 2022, 2023) employed lightweight modules like Q-Former to integrate understanding and generation tasks efficiently.

Standard global image-text alignment fails to capture the hierarchical part-whole concepts inherent in visual and linguistic data. To overcome this, researchers have shifted towards fine-grained and hierarchical approaches. Early methods encoded semantic hierarchies in embedding spaces using partial order constraints or lexical entailment (Vendrov et al., 2015; Nguyen et al., 2017; Vulić and Mrkšić, 2018), while others utilized visual structures, aligning text with segmented image regions (Arbeláez et al., 2011; Zhang and Maire, 2020) or focusing on object-level contrastive learning (Xie et al., 2021). Recent developments have exploited hyperbolic geometric spaces for hierarchical representation. HyCoCLIP (Pal et al., 2025), for instance, models image-text relationships within this space, employing cross-modal contrastive learning between parts and wholes and using the entailment cone loss (Ganea et al., 2018) to enforce hierarchical constraints both within and across modalities. However, its effectiveness depends on clear part-whole structures or explicit semantic relationships in the data.

3. Methodology

This section details the proposed Hierarchical Cross-granularity Contrastive and Matching Learning (HCCM) framework. We first outline the base vision-language encoding process (3.1). Subsequently, we introduce the Momentum Contrast and Momentum Distillation (MCD) mechanism employed for stabilizing global alignment (3.4). Then, we elaborate on the two core components: Region-to-Global Image-Text Contrastive Learning (RG-ITC) for hierarchical semantics learning (3.2) and Region-Global Image-Text Matching Learning (RG-ITM) for compositional semantics understanding (3.3). Finally, the overall training objective combining these elements is presented (3.5). Figure 2 illustrates the evolution from standard alignment frameworks to the proposed HCCM approach.

3.1. Vision-Language Encoding

The model processes data in batches of $N$ samples, where each input consists of a global image-text pair $(I_{i},T_{i})$ . Each image $I_{i}$ is associated with region patches $I_{i,k}$ , defined by bounding boxes $b_{i,k}$ and extracted from $I_{i}$ using ROI Align, along with corresponding text fragments $T_{i,k}$ . We adopt XVLM (Zeng et al., 2022) as the fundamental architecture, which comprises an image encoder $E_{v}$ , a text encoder $E_{t}$ , and a cross-modal fusion encoder $E_{f}$ .

Global inputs $I_{i}$ and $T_{i}$ are encoded by $E_{v}$ and $E_{t}$ to produce feature sequences $f_{v}^{(i)}$ and $f_{t}^{(i)}$ , with their [CLS] token embeddings $f_{v,[\text{CLS}]}^{(i)}$ and $f_{t,[\text{CLS}]}^{(i)}$ serving as global aggregated features. Regional patches $I_{i,k}$ and text fragments $T_{i,k}$ are similarly encoded to yield regional [CLS] embeddings $f_{v,[\text{CLS}]}^{(i,k)}$ and $f_{t,[\text{CLS}]}^{(i,k)}$ .

For contrastive learning, all [CLS] token embeddings are mapped through modality-specific projection layers (online: $\phi_{v},\phi_{t}$ ; momentum: $\phi_{v}^{m},\phi_{t}^{m}$ ) and L2-normalized to generate similarity computation embeddings including global online $z_{v}^{(i)},z_{t}^{(i)}$ , global momentum $z_{v}^{m(i)},z_{t}^{m(i)}$ , and regional online $z_{v}^{(i,k)},z_{t}^{(i,k)}$ .

3.2. Region-to-Global Image-Text Contrastive Learning

Standard Image-Text Contrastive (ITC) learning aims to globally align the semantic representations of image-text pairs, but it overlooks fine-grained cross-modal semantic information. We introduce Region-to-Global Image-Text Contrastive Learning (RG-ITC) to explicitly model part-to-whole cross-modal semantic hierarchical relationships, as illustrated in Figure 3(a).

Achieving this hierarchical modeling involves contrasting regional online embeddings against global momentum embeddings within a data batch. Specifically, for a batch of $N$ samples, regional pairs $(I_{i,k},T_{i,k})$ (where $k$ indexes regions in sample $i$ ) are processed via online encoders ( $E_{v},E_{t}$ ), projection layers ( $\phi_{v},\phi_{t}$ ), and L2 normalization to yield regional online embeddings $z_{v}^{(i,k)}$ and $z_{t}^{(i,k)}$ . The contrastive learning objective is then applied: for a regional visual embedding $z_{v}^{(i,k)}$ , its positive counterpart is the global textual momentum embedding $z_{t}^{m(i)}$ from the same sample $i$ , while the negative counterparts are the global textual momentum embeddings $z_{t}^{m(j)}$ from all other samples $j\neq i$ . This process is symmetric for regional text embeddings $z_{t}^{(i,k)}$ , which are contrasted against the global visual momentum embedding $z_{v}^{m(i)}$ (positive) and all $z_{v}^{m(j)}$ where $j\neq i$ (negatives).

The RG-ITC loss $\mathcal{L}_{RG-ITC}$ over all valid region pairs $(i,k)\in\mathcal{R}_{N}$ in the batch is:

[TABLE]

where $s(\cdot,\cdot)$ denotes cosine similarity, $\tau$ is the temperature, and $N$ is batch size. Minimizing $\mathcal{L}_{RG-ITC}$ fosters learning of local-to-global cross-modal associations.

3.3. Region-Global Image-Text Matching Learning

Image-Text Matching (ITM) is a core task for fine-grained vision-language understanding, typically assessing if a global image-text pair $(I_{i},T_{i})$ matches. Standard ITM often employs a fusion encoder $E_{f}$ to combine global features $f_{v}^{(i)}$ and $f_{t}^{(i)}$ , feeding the fused representation (e.g., from the [CLS] token) into a classification head $H_{match}$ to predict the match probability $p_{match}$ . Training minimizes a binary cross-entropy (BCE) loss $\mathcal{L}_{ITM}$ over positive and negative pairs:

[TABLE]

where $\mathcal{P}$ and $\mathcal{N}$ are sets of positive and negative pairs, and $y\in\{0,1\}$ is the ground-truth label.

However, standard ITM primarily focuses on global alignment. To better capture local-global consistency and composite semantics (Figure 3(b)), we introduce Region-Global ITM (RG-ITM). RG-ITM evaluates the alignment between uni-modal local features (regions/fragments) and the cross-modal global representation. Using the shared fusion encoder $E_{f}$ with global features $f_{v}^{(i)},f_{t}^{(i)}$ and regional features $f_{v}^{(i,k)},f_{t}^{(i,k)}$ , we construct positive fused representations by pairing regional features with their corresponding cross-modal global features:

[TABLE]

The set of these positive examples is $\mathcal{H}_{\text{pos}}=\bigl{\{}h_{rv-gt}^{+(i,k)},\;h_{gv-rt}^{+(i,k)}\,\bigm{|}\,(i,k)\in\mathcal{R}_{N}\bigr{\}}$ , where $\mathcal{R}_{N}$ covers all valid region pairs $(i,k)$ in the batch.

To improve discrimination, we employ hard negative mining based on online embedding similarity (Figure 3(b)). For each region $(i,k)$ with online embeddings $z_{v}^{(i,k)},z_{t}^{(i,k)}$ , we sample hard negative global counterparts from other samples $j^{\prime},l^{\prime}(\neq i)$ with probabilities proportional to their embedding similarity:

[TABLE]

This yields hard negative indices $j^{-}_{i,k}$ and $l^{-}_{i,k}$ for each region $(i,k)$ . Negative fused representations are then generated using these sampled indices:

[TABLE]

The set of negative examples is $\mathcal{H}_{\text{neg}}=\bigl{\{}h_{rv-gt}^{-(i,k)},\;h_{gv-rt}^{-(i,k)}\,\bigm{|}\,(i,k)\in\mathcal{R}_{N}\bigr{\}}$ .

Finally, we combine positive and negative examples $\mathcal{H}_{\text{RG}}=\mathcal{H}_{\text{pos}}\cup\mathcal{H}_{\text{neg}}$ . Each representation $h\in\mathcal{H}_{\text{RG}}$ is processed by the shared matching head $H_{\text{match}}$ to get a probability $p_{\text{match}}^{(h)}$ :

[TABLE]

The RG-ITM loss $\mathcal{L}_{\text{RG-ITM}}$ minimizes the BCE over all examples in $\mathcal{H}_{\text{RG}}$ :

[TABLE]

where $y^{(h)}$ is 1 for positive examples and 0 for negative ones.

3.4. Stabilizing Global Alignment with Momentum Contrast and Distillation

However, employing only the hierarchical strategies encounters certain limitations within drone scenarios. Due to the large scale of drone bird’s-eye views, textual descriptions ( $T_{i}$ and $T_{i,k}$ ) are often incomplete or ambiguous. When faced with this situation, the aforementioned methods, during the process of cross-modal local-to-global information alignment, can amplify the potential negative effects stemming from local descriptive inaccuracies or omissions. This, in turn, impacts the model’s crucial global alignment.

To address this issue, we introduce a dual technique applied to the global contrastive learning between $I_{i}$ and $T_{i}$ : Momentum Contrast and Momentum Distillation (MCD) (Figure 3(a)). These utilize online ( $\theta$ ) and momentum ( $\theta^{m}$ , updated via Exponential Moving Average) encoders.

Momentum Contrast. Inspired by MoCo (He et al., 2020), we employ momentum queues ( $Q_{v},Q_{t}$ ) storing historical global momentum features ( $z_{v}^{m},z_{t}^{m}$ ). The resulting large, stable negative sets ( $Z_{v}^{m}=\{z_{v}^{m(j)}\}_{j=1}^{N}\cup Q_{v}$ , $Z_{t}^{m}=\{z_{t}^{m(j)}\}_{j=1}^{N}\cup Q_{t}$ ) enhance the model’s discriminative ability in noisy data, forcing it to focus on true global differences rather than potentially amplified local noise signals.

Momentum Distillation. Inspired by ALBEF (Li et al., 2021b), we generate soft targets ( $q^{(i)}$ ) for global ITC by blending momentum model predictions with ground-truth labels ( $y^{(i)}$ ) using coefficient $\alpha$ and temperature parameter $\tau$ :

[TABLE]

Momentum distillation, through temporally smoothed model parameters, constructs a trend-based supervisory signal for the model’s learning process, guiding the online encoder to resist noise interference.

The stabilized global ITC loss using MCD, denoted as $\mathcal{L}_{ITC_{MCD}}$ , employs cross-entropy ( $H(\cdot,\cdot)$ ) between online global predictions and these soft targets:

[TABLE]

This approach, by stabilizing the foundational global alignment, enhances model performance and robustness when handling local-to-global information in challenging drone environments.

3.5. Overall Training Objective

HCCM’s overall training objective $\mathcal{L}_{total}$ combines multiple losses to jointly optimize hierarchical, part-to-whole cross-modal alignment and learn complex compositional semantics, particularly those involving relational structures in visual scenes. Consistent with standard practices (Zeng et al., 2022; Chu et al., 2024), we incorporate a bounding box regression loss, $\mathcal{L}_{Box}$ , to further refine the model’s region localization capabilities. This loss penalizes the difference between the predicted bounding box $\hat{b}_{k}$ (regressed from fused features via head $H_{box}$ ) and the ground-truth box $b_{i,k}$ corresponding to text fragment $T_{i,k}$ , using a combination of L1 distance and GIoU loss (Rezatofighi et al., 2019):

[TABLE]

where $\lambda_{L1}$ and $\lambda_{GIoU}$ are respective weights.

The complete training objective $\mathcal{L}_{total}$ for the HCCM framework is the weighted sum of all constituent losses:

[TABLE]

This objective combines the stabilized global contrastive loss with MCD ( $\mathcal{L}_{ITC_{MCD}}$ , Eq. 12), the standard global matching loss ( $\mathcal{L}_{ITM}$ , Eq. 2), the proposed region-global contrastive ( $\mathcal{L}_{RG-ITC}$ , Eq. 1) and matching ( $\mathcal{L}_{RG-ITM}$ , Eq. 10) objectives, along with the bounding box regression term ( $\mathcal{L}_{Box}$ , Eq. 13). Each component’s contribution is balanced by its weight $w_{(\cdot)}$ . Minimizing $\mathcal{L}_{total}$ guides the HCCM model to effectively learn hierarchical, multi-granularity vision-language representations by integrating both global context and local details.

4. Experiments

This section evaluates the proposed HCCM method on the Natural Language-Guided Drone (NLGD) tasks of UAV text navigation (text-to-image retrieval) and UAV view target localization (image-to-text retrieval). We first outline the experimental setup, detailing the datasets, metrics (4.1), and implementation (4.2). We then present comparative results against state-of-the-art methods (4.3), followed by ablation studies (4.4) and an assessment of zero-shot generalization (4.5). Finally, visualization is provided (4.6).

4.1. Datasets and Evaluation Metrics

We conduct training and primary evaluation using the GeoText-1652 dataset (Chu et al., 2024), strictly adhering to its official data splits and evaluation protocols. For assessing zero-shot generalization, we additionally evaluate performance on the ERA dataset (Huang et al., 2024).

Across all experiments, performance is measured using Recall@K metrics, specifically Recall@1 (R@1), Recall@5 (R@5), and Recall@10 (R@10). In the zero-shot generalization evaluation, we additionally use mean Recall (mR).

4.2. Implementation Details

For fair comparison, we follow the setup from the GeoText-1652 (Chu et al., 2024), which uses a standard XVLM (Zeng et al., 2022) model pre-trained on 16 million image-text pairs as the backbone. We fine-tune our model on GeoText-1652 dataset and employ the AdamW (Loshchilov and Hutter, 2019) optimizer (initial learning rate $3\times 10^{-5}$ , weight decay $0.01$ ) for 6 epochs with batch size 24. Key hyperparameters for our HCCM method are as follows: momentum $\beta=0.995$ , queue size $Q=57,600$ , distillation $\alpha=0.4$ , and temperature $\tau=0.07$ . Loss weights, determined via preliminary search, are $w_{itc}=0.25$ , $w_{itm}=1$ , $w_{rg-itc}=0.25$ , $w_{rg-itm}=0.5$ , and $w_{box}=0.1$ .

4.3. Comparison with State-of-the-art Methods

In this experiment, we compare our HCCM with existing competitive methods on the GeoText-1652 dataset under both zero-shot and fine-tuned settings. We report the results of R@1, R@5, and R@10 across all methods, alongside the model parameters and the pretrained image size used. Notably, we reproduce HyCoCLIP (Pal et al., 2025) on the NLGD task using publicly available code.

From the results shown in Table 1, we can observe that:

In both Image Query and Text Query settings, our method achieves the best performance, significantly outperforming other methods across all metrics. Specifically, we reach 28.8% R@1 in the Image Query setting and 14.7% R@1 in the Text Query setting.
Compared to the state-of-the-art method Geotext-1652 (Chu et al., 2024) of the NLGD task, our method models the hierarchical relationships across modalities and performs fine-grained features alignment, leading to superior performance.
HyCoCLIP (Pal et al., 2025) employs compositional entailment learning to model the part-whole hierarchical relationships, which is greatly limited in drone scenarios. In contrast, our method utilizes the proposed RG-ITC and RG-ITM learning strategies, which are better at extracting the complex, intertwined spatial semantic information present in drone scenes.
Compared to Text Query setting, all methods perform better in Image Query setting, indicating that the text retrieval task are more challenging.

4.4. Ablation Study

We perform ablation studies on the GeoText-1652 (Chu et al., 2024) dataset to evaluate the effectiveness of individual components of the proposed HCCM. Refer to (Chu et al., 2024), the standard XVLM (Zeng et al., 2022) is adopted as our baseline. Results are presented in Table 2 and Table 3.

As the result shown in Table 2, we evaluate the contribution of individual components of our proposed HCCM method. When incorporating only the momentum contrast MC and the momentum distillation MD (row 3), the R@1 of the baseline model can be improved from 25.51% to 26.86% (+1.35 points) in the Image Query setting, indicating the usefulness of enhancing global representation stability. Similarly, when integrating only the cross-granularity learning components (i.e., RG-ITC and RG-ITM), the R@1 of the model can be raised to 27.04% (+1.53 points), confirming the benefit of capturing fine-grained information. By combining all components, our proposed HCCM can achieve the highest performance with an Image Query R@1 of 28.82% and a Text Query R@1 of 14.73%, which surpasses the momentum-only configuration (row 3) by 1.96 points and the cross-granularity-only setup (row 5) by 1.78 points in Image Query R@1. The above results highlight a significant synergy: robust global alignment provides a stable foundation, while cross-granularity learning contributes essential fine-grained details, leading to optimal performance when combined.

Further analysis in Table 3 investigates the impact of directional losses within RG-ITC and RG-ITM. Compared to the HCCM model (row 1), removing any single directional loss component results in performance degradation. Notably, excluding the text-region to global image association in RG-ITC (i.e., $-\mathcal{L}_{RG-ITC}(T_{i,k}\rightarrow I_{i})$ in row 3), the Image Query R@1 of the model is reduced by 1.93 points to 26.89%. Similarly, removing the text-region to global image matching in RG-ITM (i.e., $-\mathcal{L}_{RG-ITM}(T_{i,k}\leftrightarrow I_{i})$ in row 5), a drop of 1.96 points to 26.86% is caused. This suggests that learning text-to-image associations is particularly crucial, and confirms that all proposed bidirectional cross-modal learning components contribute positively to the overall performance.

4.5. Zero-shot Generalization Evaluation

We assessed generalization via zero-shot cross-modal retrieval on the unseen ERA dataset (Huang et al., 2024), using models fine-tuned on GeoText-1652 (Chu et al., 2024). Table 4 compares HCCM with other competitive methods fine-tuned on GeoText-1652 (Chu et al., 2024; Pal et al., 2025; Zeng et al., 2022) and benchmark models fine-tuned directly on ERA (Faghri et al., 2018; Song and Soleymani, 2019; Radford et al., 2021b; Chun et al., 2021; Yuan et al., 2022a, b; Huang et al., 2024).

In zero-shot evaluation (Table 4, bottom), HCCM surpasses all methods fine-tuned only on GeoText-1652, achieving state-of-the-art R@1 (19.93% in image retrieval, 18.58% in text retrieval) and mR (39.93%). This notably exceeds the suboptimal GeoText-1652 (Chu et al., 2024) (+2.02% image R@1, +1.49% text R@1, +1.93% mR). Crucially, the zero-shot performance of HCCM on ERA even exceeds that of models fine-tuned directly on ERA (Table 4, top). The zero-shot mR (39.93%) of HCCM surpasses the best fine-tuned mR (38.96% by VCSR (Huang et al., 2024)). Likewise, the zero-shot R@1 scores of HCCM outperform the best respective fine-tuned R@1 results (14.18% image (Yuan et al., 2022a), 15.65% text (Huang et al., 2024)). The above results demonstrate that the hierarchical cross-granularity learning strategy of HCCM leads to exceptional zero-shot generalization capabilities. By generating robust transferable representations, HCCM effectively models compositional semantics to achieve powerful generalization, even surpassing models fine-tuned on the target domain.

4.6. Visualizing Attention for Semantic Grounding

We compare GradCAM (Selvaraju et al., 2017) activation maps (Figure 4) for HCCM (row d) and SOTA GeoText-1652 (Chu et al., 2024) (row c) to assess fine-grained and compositional grounding against text descriptions (row a).

The SOTA method struggles to accurately ground fine-grained entities (e.g., ”blue dome” col 2, ”solar panels” col 5) and compositional relationships (e.g., ”building surrounded by” col 1, ”parking lot”/”road” layout col 6). Its diffuse attention in complex scenes (cols 3, 4) suggests difficulty parsing intricate arrangements, possibly from over-reliance on global matching. Conversely, HCCM shows significantly improved semantic grounding, localizing fine-grained details (”blue dome” col 2, ”solar panels” col 5) and better capturing compositional semantics: relating fields to ”surrounding buildings” (col 1), reflecting campus structure via ”roads and pathways” (col 3), linking fields to ”residential streets” (col 4), and grounding the building/”parking lot”/”road” context (col 6). This improved relational grounding, consistent with RG-ITC and RG-ITM objectives, is likely aided by RG-ITM’s consistency evaluation. Notably, in scenes with large uniform areas (e.g., fields in cols 1, 3), neither model strongly activates the primary object. This might occur because distinguishing such scenes relies more on grounding discriminative textual descriptions of contrasting features, boundaries, or relationships than on the homogenous central region.

In summary, HCCM’s hierarchical modeling (RG-ITC) and consistency evaluation (RG-ITM) yield more precise semantic grounding than global alignment methods alone. Its enhanced relational interpretation benefits complex NLGD tasks, especially in drone-view scenarios.

5. Conclusion

This paper introduces HCCM to enhance fine-grained and compositional understanding in Natural Language-Guided Drone (NLGD) tasks. By integrating Region-Global Contrastive (RG-ITC) and Matching (RG-ITM) learning, HCCM effectively models hierarchical cross-modal semantics and evaluates local-global consistency without requiring strict entity partitioning. Furthermore, a Momentum Contrast and Distillation (MCD) mechanism stabilizes global alignment against ambiguous descriptions common in drone views. Extensive experiments validate HCCM’s effectiveness, demonstrating state-of-the-art performance on the GeoText-1652 benchmark and superior zero-shot generalization on the ERA dataset, highlighting its robustness for complex drone vision-language applications.

Acknowledgements.

This work is supported by the National Natural Science Foundation of China under Grant No. 62276221, 62376232; the Open Project Program of Fujian Key Laboratory of Big Data Application and Intellectualization for Tea Industry, Wuyi University (No. FKLBDAITI202401).

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Arbeláez et al. (2011) Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. 2011. Contour Detection and Hierarchical Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 5 (2011), 898–916.
3Arpit et al. (2017) Devansh Arpit, Stanisław Jastrz \k ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. 2017. A Closer Look at Memorization in Deep Networks. In Proceedings of the 34th International Conference on Machine Learning , Vol. 70. 233–242.
4Chen et al. (2022) Mei Chen, Xiaoyan Wang, Hong Wang, and Shufang Zhao. 2022. A UAV-Based Energy-Efficient and Real-Time Object Detection System with Multi-Source Image Fusion. Journal of Circuits, Systems and Computers 31, 09 (2022), 2250166.
5Chen et al. (2020) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European conference on computer vision . 104–120.
6Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. (2014), 1724–1734.
7Chu et al. (2024) Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. 2024. Towards natural language-guided drones: Geo Text-1652 benchmark with spatial relation matching. In European Conference on Computer Vision . 213–231.
8Chun et al. (2021) Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio de Rezende, Yannis Kalantidis, and Diane Larlus. 2021. Probabilistic Embeddings for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 8415–8424.