StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

Jinghao Hu; Yuhe Zhang; GuoHua Geng; Kang Li; Han Zhang

arXiv:2602.21273·cs.CV·March 9, 2026

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang

PDF

Open Access

TL;DR

StoryTailor is a zero-shot pipeline that generates coherent, action-rich visual narratives from long prompts, maintaining subject identity and background continuity without fine-tuning, using innovative modules on a single high-end GPU.

Contribution

The paper introduces a novel zero-shot method with three modules for producing multi-subject visual narratives, addressing key challenges in action faithfulness and scene coherence.

Findings

01

CLIP-T improves by up to 15% over baselines

02

Inference is faster than FluxKontext on a 24 GB GPU

03

Qualitative results show expressive, stable scene generation

Abstract

Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition