# Fusing Geometric and Semantic Features via Cosine Similarity Cross-Attention for Remote Sensing Scene Classification

**Authors:** Xuefei Xu, Chengjun Xu

PMC · DOI: 10.3390/s26051613 · Sensors (Basel, Switzerland) · 2026-03-04

## TL;DR

This paper introduces a new framework for classifying remote sensing images by combining geometric and semantic features using a cross-attention mechanism, achieving high accuracy with low computational cost.

## Contribution

The novel CBCAM-LGM framework uses a bidirectional cross-attention module to efficiently fuse multi-level features, achieving state-of-the-art performance with reduced complexity.

## Key findings

- Fusing shallow and high-level features improves classification accuracy in remote sensing scenes.
- The proposed model achieves 97.81% accuracy on the AID dataset, surpassing ViT-B-16 by 1.63%.
- The model reduces computational complexity to 1.21 GMACs while maintaining high performance.

## Abstract

What are the main findings?
The extraction and integration of multi-level features—such as shallow and high-level features—significantly enhance the accuracy of remote sensing scene classification.A bidirectional cross-attention mechanism effectively fuses shallow and high-level features while suppressing redundant information and improving feature discriminability.

The extraction and integration of multi-level features—such as shallow and high-level features—significantly enhance the accuracy of remote sensing scene classification.

A bidirectional cross-attention mechanism effectively fuses shallow and high-level features while suppressing redundant information and improving feature discriminability.

What are the implications of the main findings?
Shallow and high-level features capture complementary information: shallow features preserve physical structures and local details (e.g., edges and textures), while high-level features encode rich semantic content.The fusion of these heterogeneous features enhances the overall representational capacity of the model, leading to more robust and interpretable scene classification, especially in complex environments with high intra-class variation and inter-class similarity.

Shallow and high-level features capture complementary information: shallow features preserve physical structures and local details (e.g., edges and textures), while high-level features encode rich semantic content.

The fusion of these heterogeneous features enhances the overall representational capacity of the model, leading to more robust and interpretable scene classification, especially in complex environments with high intra-class variation and inter-class similarity.

High-resolution remote sensing image scene classification (HRRSI-SC) is crucial for obtaining accurate Earth surface information. However, the task remains challenging due to significant background interference, high intra-class variation, and subtle inter-class similarities. Convolutional neural networks (CNNs) are constrained by their local receptive fields, which limits their ability to capture long-range spatial dependencies. On the other hand, Vision Transformers (e.g., ViT-B-16) excel at global feature extraction but often suffer from high computational complexity and may lack the inherent inductive biases for local feature modeling that CNNs possess. To address these limitations, this paper proposes a cross-level feature complementary classification framework based on Lie Group manifold space, termed CBCAM-LGM. Within the proposed CBCAM-LGM framework, multi-granularity features are first distilled via a global average pooling layer to suppress redundant information. The core of our approach, the cross-level bidirectional complementary attention module (CBCAM), then enables the adaptive fusion of features from both branches through a cross-query attention mechanism. Furthermore, by employing parallel dilated convolutions and a parameter-sharing strategy, the model captures multi-scale contextual information by sharing a single set of convolutional weights, which reduces the computational complexity to merely 1.21 GMACs while preserving multi-scale representation with minimal parameter overhead. Extensive experiments on challenging benchmarks demonstrate the model’s efficacy, as it achieves a state-of-the-art classification accuracy of 97.81% on the AID, surpassing the ViT-B-16 baseline by 1.63%, while containing only 11.237 million parameters (an 87% reduction). These results collectively affirm that our model presents an efficient solution characterized by high accuracy and low complexity.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12987323/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12987323/full.md

## References

73 references — full list in the complete paper: https://tomesphere.com/paper/PMC12987323/full.md

---
Source: https://tomesphere.com/paper/PMC12987323