Bootstrapped Self-Supervised Training with Monocular Video for Semantic Segmentation and Depth Estimation
Yihao Zhang, John J. Leonard

TL;DR
This paper introduces a bootstrapped self-supervised learning framework that leverages temporal consistency in monocular videos to improve semantic segmentation and depth estimation beyond initial supervised training.
Contribution
It presents a novel self-supervised training method that enhances pre-trained models using unlabeled monocular video data, improving both segmentation and depth estimation.
Findings
Improved semantic segmentation accuracy over baseline models.
Enhanced depth estimation performance compared to purely supervised or self-supervised methods.
Demonstrated effectiveness on real-world monocular video datasets.
Abstract
For a robot deployed in the world, it is desirable to have the ability of autonomous learning to improve its initial pre-set knowledge. We formalize this as a bootstrapped self-supervised learning problem where a system is initially bootstrapped with supervised training on a labeled dataset and we look for a self-supervised training method that can subsequently improve the system over the supervised training baseline using only unlabeled data. In this work, we leverage temporal consistency between frames in monocular video to perform this bootstrapped self-supervised training. We show that a well-trained state-of-the-art semantic segmentation network can be further improved through our method. In addition, we show that the bootstrapped self-supervised training framework can help a network learn depth estimation better than pure supervised training or self-supervised training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning
