Enhanced Multi-Scale Cross-Attention for Person Image Generation
Hao Tang, Ling Shao, Nicu Sebe, Luc Van Gool

TL;DR
This paper introduces XingGAN, a novel multi-scale cross-attention GAN for person image generation that effectively captures appearance and shape, outperforming existing methods in quality and speed.
Contribution
The paper presents a new cross-attention mechanism, multi-scale blocks, and a densely connected co-attention module, advancing person image generation with improved accuracy and efficiency.
Findings
Outperforms current GAN-based methods in quality.
Achieves comparable results to diffusion-based methods.
Significantly faster training and inference than diffusion models.
Abstract
In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Face recognition and analysis
MethodsSoftmax · Attention Is All You Need
