VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction

Rongxin Jiang; Robert Long; Chenghao Gu; Mingrui Yan

arXiv:2508.02890·cs.CV·August 6, 2025

VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction

Rongxin Jiang, Robert Long, Chenghao Gu, Mingrui Yan

PDF

TL;DR

VisuCraft is a framework that significantly improves large vision-language models' ability to generate complex, creative, and visually grounded long-form texts by integrating structured visual information and dynamic prompt generation.

Contribution

It introduces a multimodal structured information extractor and a dynamic prompt module to enhance LVLMs' creative and visual fidelity capabilities.

Findings

01

Outperforms baseline LVLMs in creativity and instruction adherence

02

Achieves significant improvements in story and poetry generation

03

Validated on the ImageStoryGen-500K dataset with VisuGen Metrics

Abstract

This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses these challenges by integrating a multimodal structured information extractor (E) and a dynamic prompt generation module (G). The extractor distills fine-grained visual attributes from input images into a rich, structured representation, which the dynamic prompt module then combines with user instructions to create highly optimized prompts for underlying LVLMs (e.g., LLaVA, InstructBLIP). Evaluated on the self-constructed ImageStoryGen-500K dataset using VisuGen Metrics (Visual Grounding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.