TL;DR
ReCorD is a training-free method that enhances diffusion-based image generation by coupling latent diffusion with visual language models, improving the depiction of human-object interactions with higher fidelity and efficiency.
Contribution
It introduces a novel reasoning and correcting framework that refines HOI generation without additional training, combining interaction-aware reasoning and correction modules for better accuracy.
Findings
Outperforms existing methods in HOI classification score
Achieves higher FID and Verb CLIP-Score
Reduces computational requirements
Abstract
Diffusion models revolutionize image generation by leveraging natural language to guide the creation of multimedia content. Despite significant advancements in such generative models, challenges persist in depicting detailed human-object interactions, especially regarding pose and object placement accuracy. We introduce a training-free method named Reasoning and Correcting Diffusion (ReCorD) to address these challenges. Our model couples Latent Diffusion Models with Visual Language Models to refine the generation process, ensuring precise depictions of HOIs. We propose an interaction-aware reasoning module to improve the interpretation of the interaction, along with an interaction correcting module to refine the output image for more precise HOI generation delicately. Through a meticulous process of pose selection and object positioning, ReCorD achieves superior fidelity in generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion
