TL;DR
This paper introduces a bitemporal image transformer (BIT) for remote sensing change detection, which models spatial-temporal contexts efficiently using semantic tokens, outperforming convolutional methods in accuracy and computational cost.
Contribution
The paper proposes a novel transformer-based framework that uses semantic tokens to model spatial-temporal context in remote sensing change detection, improving efficiency and accuracy over existing methods.
Findings
Outperforms convolutional baselines with 3x lower computational costs
Surpasses several state-of-the-art attention-based methods in accuracy
Effective with a simple ResNet18 backbone without complex structures
Abstract
Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD remains challenging due to the complexity of objects in the scene. Objects with the same semantic concept may show distinct spectral characteristics at different times and spatial locations. Most recent CD pipelines using pure convolutions are still struggling to relate long-range concepts in space-time. Non-local self-attention approaches show promising performance via modeling dense relations among pixels, yet are computationally inefficient. Here, we propose a bitemporal image transformer (BIT) to efficiently and effectively model contexts within the spatial-temporal domain. Our intuition is that the high-level concepts of the change of interest can be represented by a few visual words, i.e., semantic tokens. To achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution · 1x1 Convolution · Feature Pyramid Network
