PokeFusion Attention: A Lightweight Cross-Attention Mechanism for Style-Conditioned Image Generation
Jingbang Tang

TL;DR
PokeFusion Attention introduces a lightweight, decoder-level cross-attention mechanism that models style as a learned prior, enabling efficient style-conditioned image generation without external references.
Contribution
It presents a parameter-efficient, plug-and-play style conditioning method that improves style fidelity and structural consistency in diffusion-based image generation.
Findings
Enhances style fidelity and semantic alignment in stylized character generation.
Maintains low parameter overhead and simple inference.
Outperforms adapter-based baselines in style-conditioned generation.
Abstract
Style-conditioned text-to-image (T2I) generation with diffusion models requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches either rely on text-only prompting, which is often insufficient to specify visual style, or introduce reference-based adapters that depend on external images at inference time, increasing system complexity and limiting deployment flexibility. We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that models style as a learned distributional prior rather than instance-level conditioning. The method integrates textual semantics with learned style embeddings directly within the diffusion decoder, enabling effective stylized generation without requiring reference images at inference time. Only the cross-attention layers and a compact style projection module…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
