On Moving Object Segmentation from Monocular Video with Transformers
Christian Homeyer, Christoph Schn\"orr

TL;DR
This paper introduces M3Former, a transformer-based architecture for monocular moving object segmentation that effectively fuses appearance and motion features, and analyzes motion representations and training data diversity for optimal performance.
Contribution
The paper presents a novel transformer-based fusion architecture for monocular motion segmentation and provides a systematic analysis of motion representations and dataset diversity impacts.
Findings
M3Former achieves state-of-the-art performance on KITTI and DAVIS datasets.
Different 2D and 3D motion representations significantly affect segmentation accuracy.
Diverse training datasets are crucial for optimal monocular motion segmentation.
Abstract
Moving object detection and segmentation from a single moving camera is a challenging task, requiring an understanding of recognition, motion and 3D geometry. Combining both recognition and reconstruction boils down to a fusion problem, where appearance and motion features need to be combined for classification and segmentation. In this paper, we present a novel fusion architecture for monocular motion segmentation - M3Former, which leverages the strong performance of transformers for segmentation and multi-modal fusion. As reconstructing motion from monocular video is ill-posed, we systematically analyze different 2D and 3D motion representations for this problem and their importance for segmentation performance. Finally, we analyze the effect of training data and show that diverse datasets are required to achieve SotA performance on Kitti and Davis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Video Surveillance and Tracking Methods
