TL;DR
AlphaGRPO is a new framework that enhances multimodal generation in UMMs by applying a self-reflective, reward-based approach using decomposed, verifiable feedback, leading to improved reasoning and editing capabilities.
Contribution
It introduces AlphaGRPO, combining Group Relative Policy Optimization with a novel Decompositional Verifiable Reward for stable, interpretable supervision in multimodal generation.
Findings
Achieves robust improvements on multiple multimodal benchmarks.
Enhances reasoning and self-reflective capabilities in UMMs.
Improves editing tasks without specific training on editing.
Abstract
In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
