VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction
Rongxin Jiang, Robert Long, Chenghao Gu, Mingrui Yan

TL;DR
VisuCraft is a framework that significantly improves large vision-language models' ability to generate complex, creative, and visually grounded long-form texts by integrating structured visual information and dynamic prompt generation.
Contribution
It introduces a multimodal structured information extractor and a dynamic prompt module to enhance LVLMs' creative and visual fidelity capabilities.
Findings
Outperforms baseline LVLMs in creativity and instruction adherence
Achieves significant improvements in story and poetry generation
Validated on the ImageStoryGen-500K dataset with VisuGen Metrics
Abstract
This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses these challenges by integrating a multimodal structured information extractor (E) and a dynamic prompt generation module (G). The extractor distills fine-grained visual attributes from input images into a rich, structured representation, which the dynamic prompt module then combines with user instructions to create highly optimized prompts for underlying LVLMs (e.g., LLaVA, InstructBLIP). Evaluated on the self-constructed ImageStoryGen-500K dataset using VisuGen Metrics (Visual Grounding,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
