BINO: Encoder Centric Self Supervised Stereo With Native Pair Input
Haokun Zhou

TL;DR
BINO introduces an encoder-centric self-supervised stereo method that fuses stereo pairs at input to learn binocular structure, achieving state-of-the-art results without explicit linkage modules.
Contribution
It demonstrates that strong binocular structure can be learned within a compact encoder using input fusion and specialized positional encoding, reducing reliance on explicit linkage modules.
Findings
BINO achieves the best frozen descriptor results on proxy dense stereo and KITTI benchmarks.
It maintains competitive performance with a smaller encoder compared to CroCo v2.
Transfer experiments show consistent qualitative improvements.
Abstract
Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
