SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking

Nico Leuze; Maximilian Hoh; Samed Do\u{g}an; Nicolas R.-Pe\~na; Alfred Schoettl

arXiv:2512.08430·cs.CV·December 10, 2025

SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking

Nico Leuze, Maximilian Hoh, Samed Do\u{g}an, Nicolas R.-Pe\~na, Alfred Schoettl

PDF

Open Access

TL;DR

This paper presents SDT-6D, a fully sparse depth-transformer framework for precise 6D pose estimation in cluttered industrial environments, leveraging multi-view depth fusion, scene-adaptive attention, and a novel voting strategy.

Contribution

It introduces a staged heatmap mechanism and a density-aware sparse transformer for high-resolution, sparse 3D data processing in 6D pose estimation.

Findings

01

Achieves competitive accuracy on IPD and MV-YCB datasets.

02

Effectively handles occlusions and clutter in industrial bin picking.

03

Operates efficiently with high-resolution sparse volumetric data.

Abstract

Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Robotics and Sensor-Based Localization · 3D Shape Modeling and Analysis