CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation
Joohyeon Lee, Jin-Seop Lee, Jee-Hyong Lee

TL;DR
CountCluster is a training-free method that improves object quantity accuracy in text-to-image generation by clustering cross-attention maps at early denoising steps, aligning generated images with specified object counts.
Contribution
It introduces a novel inference-time clustering approach for cross-attention maps that enhances object count control without external modules or training.
Findings
Achieves 18.5% improvement in object count accuracy.
Outperforms existing methods in quantity control across various prompts.
Does not require additional training or external tools.
Abstract
Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
