RESfM: Robust Deep Equivariant Structure from Motion
Fadi Khatib, Yoni Kasten, Dror Moran, Meirav Galun, Ronen Basri

TL;DR
RESfM introduces a deep learning architecture for Structure from Motion that effectively handles outliers, improving accuracy in large, realistic image collections by combining equivariant classification and robust optimization.
Contribution
It presents a novel deep equivariant architecture with outlier classification and robust bundle adjustment for more realistic SfM scenarios.
Findings
Achieves state-of-the-art accuracy in large, outlier-prone datasets.
Outperforms existing deep SfM methods in realistic settings.
Matches classical methods' performance in challenging conditions.
Abstract
Multiview Structure from Motion is a fundamental and challenging computer vision problem. A recent deep-based approach utilized matrix equivariant architectures for simultaneous recovery of camera pose and 3D scene structure from large image collections. That work, however, made the unrealistic assumption that the point tracks given as input are almost clean of outliers. Here, we propose an architecture suited to dealing with outliers by adding a multiview inlier/outlier classification module that respects the model equivariance and by utilizing a robust bundle adjustment step. Experiments demonstrate that our method can be applied successfully in realistic settings that include large image collections and point tracks extracted with common heuristics that include many outliers, achieving state-of-the-art accuracies in almost all runs, superior to existing deep-based methods and on-par…
Peer Reviews
Decision·ICLR 2025 Poster
1. This paper improves the existing ESFM framework by integrating a new inlier-outlier classification branch, alongside a robust structure from motion (SfM) mechanism. This approach makes sense, given that outliers represent a primary challenge for SfM methods in real-world applications. Quantitative experiments demonstrate the efficacy of these components. 2. This paper conducts extensive experiments over various datasets to prove their claims.
1. The main concern about this work is the novelty of the proposed framework. Compared to ESfM, the new designs are just (1) a simple classfication branch to identify inlier/outlier, which conducts simple binary classfication, and (b) robust BA considering high projection error, point track length, and multi-step refinement. Both these two designs have been proven effective over the long time and are not the new techniques from this work. For instance, systems like COLMAP, Theia, and VGGSfM inco
Paper is well-written with a clear objective being robustifying the method proposed in [1]. Overall content of the paper is well-written, coherent, and easy to understand and follow. The proposed inlier-outlier prediction head shows improvement compared to ESFM [1]. Results across many scenes of indoor and outdoor datasets show that the proposed method improves over [1] and achieves comparable results to the best classical SfM methods. Authors use unsupervised reprojection losses. This means t
Authors mention that they did not use Mast3R [2] as one of the baseline method because it does not work with large number of images. But, I do think Mast3R [2] is able to work with large sets of images. At least on Stretcha and BlendedMVS experiments, authors should be able to use Mast3R for full 3D reconstruction What is not clear from the paper is whether other methods use Robust Bundle Adjustment proposed in this work. I agree robust BA is necessary for accurate reconstruction, but the cont
The paper is well-presented in general and easy to understand, with a comprehensive literature review and extensive experiments.
The paper studies an interesting and practical problem, but the main weakness is its limited contribution. 1. The proposed method is largely built on existing network architecture, the difference being only that outlier classification output channels are added. 2. The proposed solution for handling outliers incurs much additional overhead in the recursive "finetune" process, lacks theoretical justification, comprises inliers recall, and can not reject outliers well (according to Table 7). Th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image and Video Stabilization · Video Surveillance and Tracking Methods
