ChiTransformer:Towards Reliable Stereo from Cues
Qing Su, Shihao Ji

TL;DR
ChiTransformer introduces a biologically inspired self-supervised binocular depth estimation method using vision transformers with gated cross-attention, significantly improving stereo matching accuracy in various environments.
Contribution
The paper proposes a novel ChiTransformer architecture that leverages gated positional cross-attention in vision transformers for reliable stereo depth estimation, inspired by the human visual system.
Findings
Achieves 11% improvement over state-of-the-art methods.
Effective on both rectilinear and fisheye images.
Demonstrates robustness in dynamic and cluttered environments.
Abstract
Current stereo matching techniques are challenged by restricted searching space, occluded regions and sheer size. While single image depth estimation is spared from these challenges and can achieve satisfactory results with the extracted monocular cues, the lack of stereoscopic relationship renders the monocular prediction less reliable on its own, especially in highly dynamic or cluttered environments. To address these issues in both scenarios, we present an optic-chiasm-inspired self-supervised binocular depth estimation method, wherein a vision transformer (ViT) with gated positional cross-attention (GPCA) layers is designed to enable feature-sensitive pattern retrieval between views while retaining the extensive context information aggregated through self-attentions. Monocular cues from a single view are thereafter conditionally rectified by a blending layer with the retrieved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Softmax · Dense Connections · Multi-Head Attention · Residual Connection · Layer Normalization · Vision Transformer
