Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning
Bao Li, Xiaomei Zhang, Miao Xu, Zhaoxin Fan, Xiangyu Zhu, Zhen Lei

TL;DR
Pose-RFT introduces a hybrid reinforcement learning framework that significantly improves 3D human pose generation in multimodal large language models by jointly optimizing discrete and continuous outputs with task-specific rewards.
Contribution
It proposes Pose-RFT, a novel reinforcement fine-tuning approach with HyGRPO algorithm for better 3D pose generation in multimodal models, addressing ambiguity and alignment issues.
Findings
Outperforms existing pose-specific MLLMs on multiple benchmarks.
Effectively models spatial and semantic correspondences in 3D pose generation.
Demonstrates the benefits of hybrid action reinforcement learning for multimodal tasks.
Abstract
Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Human Pose and Action Recognition
