Latent Action Control for Reasoning-Guided Unified Image Generation

Fuxiang Zhai; Sixiang Chen; Yingjin Li; Shuaibo Li; Jianyu Lai; Tengjun Huang; Lei Zhu

arXiv:2605.16961·cs.CV·May 19, 2026

Latent Action Control for Reasoning-Guided Unified Image Generation

Fuxiang Zhai, Sixiang Chen, Yingjin Li, Shuaibo Li, Jianyu Lai, Tengjun Huang, Lei Zhu

PDF

TL;DR

This paper introduces Latent Action Control (LAC), a novel method that makes reasoning actionable within unified image generation models by learning latent action trajectories to improve control and fidelity.

Contribution

LAC is a new approach that encodes reasoning as hidden continuous actions, enabling better control over image generation based on reasoning cues.

Findings

01

LAC improves compositional and knowledge-grounded generation across multiple benchmarks.

02

Significant gains in spatial relations, attribute binding, and world-knowledge prompts.

03

Latent interventions show the learned action trajectory influences the generator.

Abstract

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.