TL;DR
This paper introduces a wavelet-based stereo matching framework that separately processes high and low frequency image components, improving convergence and accuracy in challenging scenes with fine details.
Contribution
It proposes a novel wavelet-based stereo matching framework with separate frequency processing and an LSTM-based high-frequency preservation operator, addressing convergence issues in existing methods.
Findings
Outperforms state-of-the-art methods on KITTI benchmarks.
Achieves first place on KITTI 2015 and 2012 leaderboards.
Effectively preserves high-frequency details like edges and thin objects.
Abstract
We find that the EPE evaluation metrics of RAFT-stereo converge inconsistently in the low and high frequency regions, resulting high frequency degradation (e.g., edges and thin objects) during the iterative process. The underlying reason for the limited performance of current iterative methods is that it optimizes all frequency components together without distinguishing between high and low frequencies. We propose a wavelet-based stereo matching framework (Wavelet-Stereo) for solving frequency convergence inconsistency. Specifically, we first explicitly decompose an image into high and low frequency components using discrete wavelet transform. Then, the high-frequency and low-frequency components are fed into two different multi-scale frequency feature extractors. Finally, we propose a novel LSTM-based high-frequency preservation update operator containing an iterative frequency adapter…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Despite some missing baselines, the experimental protocol is well-structured and follows standard practice, covering major benchmarks such as Scene Flow, KITTI, and ETH3D. 2. The paper provides a sufficiently detailed analysis of how different frequency components (high vs. low) behave during iterative optimization, offering an interesting perspective for understanding and improving convergence in stereo matching networks.
1. **Missing Baselines**: The paper would benefit from including comparisons with recent strong baselines such as FoundationStereo (CVPR 2025) and S²M² (ICCV 2025). These models are highly relevant and would help position the proposed method more convincingly within the current state of the art. 2. **Middlebury Benchmark**: Including results on the Middlebury dataset would strengthen the evaluation, as it provides valuable insights into model performance on high-resolution indoor scenes. 3. **Mo
1. The paper introduces the concept of frequency convergence inconsistency, a previously underexplored issue in iterative stereo matching, and provides both theoretical analysis and empirical evidence to support this claim. 2. The proposed Wavelet-Stereo module is lightweight, plug-and-play, and compatible with existing iterative models (e.g., RAFT-Stereo, MonSter). It demonstrates consistent improvements across multiple benchmarks and strong zero-shot generalization ability. 3. The method ach
1. **Lack of Comparison with Recent Methods**: The paper does not compare with recent foundational stereo models such as FoundationStereo[1] and Stereo Anywhere[2], which limits the reader's understanding of how it performs against the most recent and powerful baselines. 2. **Appendix Figure Layout Issues**: The appendix contains noticeable formatting problems, including missing references to Figure 11 and 12, which disrupts the flow and completeness of the qualitative results section. 3. **Li
- Impressive results are achieved on multiple benchmark dataset. - The proposed method is well motivated.
- The major concern is about the technical novelty of the proposed method. Incorporating frequency-based ideas into neural networks for image processing has been widely investigated in a wide range of tasks. In the area of stereo matching, a couple of approaches have been developed (e.g., Waveletstereo). Consequently, the difference - While the proposed method produces impressive performance, its additional cost is not fully discussed. In Table 5, only runtime is presented. More analyese in te
The paper introduces a clean and intuitive insight—explicitly separating high- and low-frequency components using Haar wavelets—to address the inherent tension in stereo matching between preserving fine details and ensuring smooth, consistent disparities. The proposed HPU operator is a simple yet impactful mechanism that prevents the blurring of texture details during iterative refinement by decoupling high-frequency context from low-frequency propagation.
1.Insufficient Evidence on Computational Efficiency: The paper heavily emphasizes faster convergence (e.g., achieving comparable results in 2 vs. 32 iterations) as a key advantage, but it provides an incomplete analysis of the overall computational cost. The introduced wavelet decomposition, dual-branch feature extraction, and complex HPU operator likely incur significant overhead, and without a comprehensive comparison of total FLOPs, parameter count, or inference time against baselines, the cl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
