TL;DR
AlignPose is a multi-view 6D object pose estimation method that leverages feature-metric alignment across views, eliminating the need for object-specific training and improving performance on challenging datasets.
Contribution
The paper introduces AlignPose, a novel multi-view pose estimation approach that aggregates information across views without requiring object-specific training or symmetry annotations.
Findings
Outperforms existing methods on six datasets in the BOP benchmark.
Effectively handles clutter, occlusions, and unseen objects.
Excels particularly on industrial datasets with multiple views.
Abstract
Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose by minimizing the feature discrepancy between on-the-fly rendered object features…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The method shows great generalization to unseen objects since it does not need training on unseen objects. - The performance is much stronger than baselines such as CozyPose on BOP benchmarks. - The paper presentation is good.
- It seems that the paper is only leveraging the features from existing vision foundation models (e.g. DINOv2) to do LM optimization of feature loss. The paper is not training any new models. This is OK if the method works, but it would seem that the contribution of this paper is limited. - It would be interesting to see how the performance would be with different vision foundation models. - There are some other related works that also uses LM optimization on visual features to do object pose es
1. The authors perform rigorous evaluation on multiple datasets, and find that their method consistently outperforms competing approaches. 2. The presentation of the method is easy to understand with the equations that have been written. The approach seems like a reasonable thing to try. 3. Strong results are shown on both seen and unseen object categories, showing that the method can generalize well given initial coarse estimates of object poses are in the ballpark of the right answer.
1. This idea of using DINO feature space metrics for bundle adjustment has already been explored in other contexts like structure from motion (see [1, 2] below). In fact, it seems like the equations in that paper are more or less equivalent to what is proposed here. I don’t think that paper is cited. 2. The contribution seems a bit narrow here. I think this idea has been known for a while now, and is an integral part of standard bundle adjustment pipelines. It’s just something that one would d
This paper introduces AlignPose, a refinement method for unseen object pose estimation. It optimizes a consistent object pose in the world frame, jointly utilizing initial pose estimates from multiple views. AlignPose introduces a multi-view feature-metric alignment loss with non-maximum suppression, which optimizes the object pose by aligning rendered object features with real images. The refined pose is obtained by using a Levenberg-Marquardt optimization algorithm, ensuring the robustness o
The problem formulation is not clear enough. To my current understanding, the authors decompose the object pose $T_{CO}$ into two transformations, $T_{CW}$ and $T_{WO}$. This is a bit confusing since we often assume that the world frame and object frame are aligned in object pose estimation. Otherwise, it is unclear how to define the world frame beyond the object frame. A more detailed explanation would be important to improve clarity and help readers better understand the problem setup. The co
++ The performance gains on YCB-V, T-LESS, and ITODD-MV demonstrate the effectiveness of AlignPose's multi-view refinement strategy over existing methods. ++ This paper introduces a straightforward adaptation of FoundPose's feature-metric refinement to a multi-view setting, integrated with 3D NMS, and demonstrates its effectiveness with promising results.
-- The work lacks an ablation study to justify the choice of the robust cost function, including a comparison with alternatives and an analysis of its hyperparameters. -- The experimental results lack data on time or speed. Analyzing the runtime is essential for understanding the practicality of this method. -- The 3D NMS method used in this paper lacks a comparison with the translation-based 3D NMS in FreeZev2 [a]. -- The methodological innovation is somewhat limited. The multi-view feature-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
