Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Bao Li; Xiaomei Zhang; Miao Xu; Zhaoxin Fan; Xiangyu Zhu; Zhen Lei

arXiv:2508.07804·cs.CV·August 12, 2025

Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Bao Li, Xiaomei Zhang, Miao Xu, Zhaoxin Fan, Xiangyu Zhu, Zhen Lei

PDF

Open Access

TL;DR

Pose-RFT introduces a hybrid reinforcement learning framework that significantly improves 3D human pose generation in multimodal large language models by jointly optimizing discrete and continuous outputs with task-specific rewards.

Contribution

It proposes Pose-RFT, a novel reinforcement fine-tuning approach with HyGRPO algorithm for better 3D pose generation in multimodal models, addressing ambiguity and alignment issues.

Findings

01

Outperforms existing pose-specific MLLMs on multiple benchmarks.

02

Effectively models spatial and semantic correspondences in 3D pose generation.

03

Demonstrates the benefits of hybrid action reinforcement learning for multimodal tasks.

Abstract

Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition