CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo
Weitao Chen, Hongbin Xu, Zhipeng Zhou, Yang Liu, Baigui Sun, Wenxiong, Kang, Xuansong Xie

TL;DR
CostFormer introduces an efficient Transformer-based approach for cost aggregation in Multi-view Stereo, addressing CNN limitations and computational challenges to improve long-range feature aggregation.
Contribution
The paper proposes CostFormer, a novel Transformer-based cost aggregation network with RDACT and RRT modules, overcoming computational limits and enhancing MVS performance.
Findings
Effective long-range feature aggregation via self-attention.
Reduced memory usage and inference latency.
Universal plug-in for existing MVS methods.
Abstract
The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via CNNs. This may inherit the natural limitation of CNNs that fail to discriminate repetitive or incorrect matches due to limited local receptive fields. To handle the issue, we aim to involve Transformer into cost aggregation. However, another problem may occur due to the quadratically growing computational complexity caused by Transformer, resulting in memory overflow and inference latency. In this paper, we overcome these limits with an efficient Transformer-based cost aggregation network, namely CostFormer. The Residual Depth-Aware Cost Transformer(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · fail · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Adam · Residual Connection · Absolute Position Encodings · Softmax
