Building extraction with vision transformer
Libo Wang, Shenghui Fang, Rui Li, Xiaoliang Meng

TL;DR
This paper introduces BuildFormer, a novel Vision Transformer architecture with a dual-path structure for precise building extraction from remote sensing images, addressing CNN limitations in global dependency modeling and detail preservation.
Contribution
The paper proposes BuildFormer, a dual-path Vision Transformer with window-based attention, improving global dependency modeling and spatial detail preservation for remote sensing building extraction.
Findings
Achieved 75.74% IoU on Massachusetts dataset.
Outperformed existing CNN-based methods.
Reduced computational complexity with window-based attention.
Abstract
As an important carrier of human productive activities, the extraction of buildings is not only essential for urban dynamic monitoring but also necessary for suburban construction inspection. Nowadays, accurate building extraction from remote sensing images remains a challenge due to the complex background and diverse appearances of buildings. The convolutional neural network (CNN) based building extraction methods, although increased the accuracy significantly, are criticized for their inability for modelling global dependencies. Thus, this paper applies the Vision Transformer for building extraction. However, the actual utilization of the Vision Transformer often comes with two limitations. First, the Vision Transformer requires more GPU memory and computational costs compared to CNNs. This limitation is further magnified when encountering large-sized inputs like fine-resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Vision Transformer · Max Pooling · Batch Normalization · Kaiming Initialization · Layer Normalization · Byte Pair Encoding
