# Model-Free Transformer Framework for 6-DoF Pose Estimation of Textureless Tableware Objects

**Authors:** Jungwoo Lee, Hyogon Kim, Ji-Wook Kwon, Sung-Jo Yun, Na-Hyun Lee, Young-Ho Choi, Goobong Chung, Jinho Suh

PMC · DOI: 10.3390/s25196167 · 2025-10-05

## TL;DR

This paper introduces a new method for estimating the 3D position and orientation of textureless tableware using a transformer model and depth data, enabling robots to grasp objects more effectively.

## Contribution

A model-free and texture-free 6-DoF pose estimation framework using transformer architecture and geometry-based features from depth images.

## Key findings

- The method achieves an average rotational error of 3.53 degrees and translational error of 13.56 mm.
- Real-world experiments show successful autonomous recognition and collection of tableware by a mobile robot.
- Geometry-based features like surface vertices and rim normals provide strong structural priors for pose estimation.

## Abstract

Tableware objects such as plates, bowls, and cups are usually textureless, uniform in color, and vary widely in shape, making it difficult to apply conventional pose estimation methods that rely on texture cues or object-specific CAD models. These limitations present a significant obstacle to robotic manipulation in restaurant environments, where reliable six-degree-of-freedom (6-DoF) pose estimation is essential for autonomous grasping and collection. To address this problem, we propose a model-free and texture-free 6-DoF pose estimation framework based on a transformer encoder architecture. This method uses only geometry-based features extracted from depth images, including surface vertices and rim normals, which provide strong structural priors. The pipeline begins with object detection and segmentation using a pretrained video foundation model, followed by the generation of uniformly partitioned grids from depth data. For each grid cell, centroid positions, and surface normals are computed and processed by a transformer-based model that jointly predicts object rotation and translation. Experiments with ten types of tableware demonstrate that the method achieves an average rotational error of 3.53 degrees and a translational error of 13.56 mm. Real-world deployment on a mobile robot platform with a manipulator further validated its ability to autonomously recognize and collect tableware, highlighting the practicality of the proposed geometry-driven approach for service robotics.

## Full-text entities

- **Diseases:** injury to (MESH:D014947), infection (MESH:D007239)
- **Chemicals:** DINO (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12526730/full.md

---
Source: https://tomesphere.com/paper/PMC12526730