QORT-Former: Query-optimized Real-time Transformer for Understanding Two   Hands Manipulating Objects

Elkhan Ismayilzada; MD Khalequzzaman Chowdhury Sayem; Yihalem Yimolal; Tiruneh; Mubarrat Tajoar Chowdhury; Muhammadjon Boboev; Seungryul Baek

arXiv:2502.19769·cs.CV·February 28, 2025

QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects

Elkhan Ismayilzada, MD Khalequzzaman Chowdhury Sayem, Yihalem Yimolal, Tiruneh, Mubarrat Tajoar Chowdhury, Muhammadjon Boboev, Seungryul Baek

PDF

Open Access 1 Video

TL;DR

QORT-Former is a novel real-time Transformer framework that efficiently estimates 3D poses of two hands and an object, significantly improving accuracy and speed for AR/VR applications.

Contribution

This paper introduces the first real-time Transformer-based framework for 3D hand-object pose estimation, optimizing queries and decoders for enhanced accuracy and efficiency.

Findings

01

Achieved 53.5 FPS on RTX 3090TI GPU.

02

Surpassed state-of-the-art accuracy on H2O and FPHA datasets.

03

Set new benchmarks in interaction recognition accuracy.

Abstract

Significant advancements have been achieved in the realm of understanding poses and interactions of two hands manipulating an object. The emergence of augmented reality (AR) and virtual reality (VR) technologies has heightened the demand for real-time performance in these applications. However, current state-of-the-art models often exhibit promising results at the expense of substantial computational overhead. In this paper, we present a query-optimized real-time Transformer (QORT-Former), the first Transformer-based real-time framework for 3D pose estimation of two hands and an object. We first limit the number of queries and decoders to meet the efficiency requirement. Given limited number of queries and decoders, we propose to optimize queries which are taken as input to the Transformer decoder, to secure better accuracy: (1) we propose to divide queries into three types (a left hand…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Hand Gesture Recognition Systems

MethodsHow do I change my contact info on Venmo? — Keep Your Contact Information Current · Absolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need · Multi-Head Attention