StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang

TL;DR
StoryTailor is a zero-shot pipeline that generates coherent, action-rich visual narratives from long prompts, maintaining subject identity and background continuity without fine-tuning, using innovative modules on a single high-end GPU.
Contribution
The paper introduces a novel zero-shot method with three modules for producing multi-subject visual narratives, addressing key challenges in action faithfulness and scene coherence.
Findings
CLIP-T improves by up to 15% over baselines
Inference is faster than FluxKontext on a 24 GB GPU
Qualitative results show expressive, stable scene generation
Abstract
Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
