BINO: Encoder Centric Self Supervised Stereo With Native Pair Input

Haokun Zhou

arXiv:2603.27904·cs.CV·March 31, 2026

BINO: Encoder Centric Self Supervised Stereo With Native Pair Input

Haokun Zhou

PDF

TL;DR

BINO introduces an encoder-centric self-supervised stereo method that fuses stereo pairs at input to learn binocular structure, achieving state-of-the-art results without explicit linkage modules.

Contribution

It demonstrates that strong binocular structure can be learned within a compact encoder using input fusion and specialized positional encoding, reducing reliance on explicit linkage modules.

Findings

01

BINO achieves the best frozen descriptor results on proxy dense stereo and KITTI benchmarks.

02

It maintains competitive performance with a smaller encoder compared to CroCo v2.

03

Transfer experiments show consistent qualitative improvements.

Abstract

Stereo needs features that preserve fine cross view correspondence rather than only semantic similarity. Recent self supervised vision models transfer well, but they are not built for this goal, and geometry focused methods often rely on a binocular decoder or another explicit linkage module during pretraining. BINO asks whether strong binocular structure can instead be learned inside a compact encoder. It does this by fusing the rectified pair at the input stage, forming stereo micro cell tokens, and using a row aware patch phase positional encoding. Training uses one view masked token only distillation together with occlusion and view specific appearance mismatch. In a strict low resource setting with pretraining only on KITTI object, BINO gives the best frozen descriptor results under a no linkage probe among all compared baselines on proxy dense stereo, hard negative retrieval, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.