AANet: Adaptive Aggregation Network for Efficient Stereo Matching
Haofei Xu, Juyong Zhang

TL;DR
AANet introduces a lightweight, efficient stereo matching architecture that replaces costly 3D convolutions with novel cost aggregation modules, achieving faster inference and competitive accuracy on benchmark datasets.
Contribution
The paper proposes a new architecture with intra-scale sparse cost aggregation and neural cross-scale approximation, significantly reducing computation while maintaining high accuracy.
Findings
Speeded up existing models by up to 41 times
Achieved competitive results on Scene Flow and KITTI datasets
Operates at 62ms inference time
Abstract
Despite the remarkable progress made by learning based stereo matching algorithms, one key challenge remains unsolved. Current state-of-the-art stereo models are mostly based on costly 3D convolutions, the cubic computational complexity and high memory consumption make it quite expensive to deploy in real-world applications. In this paper, we aim at completely replacing the commonly used 3D convolutions to achieve fast inference speed while maintaining comparable accuracy. To this end, we first propose a sparse points based intra-scale cost aggregation method to alleviate the well-known edge-fattening issue at disparity discontinuities. Further, we approximate traditional cross-scale cost aggregation algorithm with neural network layers to handle large textureless regions. Both modules are simple, lightweight, and complementary, leading to an effective and efficient architecture for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
AANet: Adaptive Aggregation Network for Efficient Stereo Matching· youtube
Taxonomy
TopicsAdvanced Vision and Imaging · Image Enhancement Techniques · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
