AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
Lu Qiu, Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Xihui Liu

TL;DR
AnimeShooter is a comprehensive multi-shot animation dataset with hierarchical annotations and visual consistency, enabling improved reference-guided video generation with a new baseline model.
Contribution
We introduce AnimeShooter, a novel dataset with detailed annotations and visual consistency features, and propose AnimeShooterGen, a baseline model for multi-shot animation generation.
Findings
Model trained on AnimeShooter shows superior visual consistency.
The dataset enables effective reference-guided multi-shot video generation.
AnimeShooter facilitates research in coherent animated content creation.
Abstract
Recent advances in AI-generated content (AIGC) have significantly accelerated animation production. To produce engaging animations, it is essential to generate coherent multi-shot video clips with narrative scripts and character references. However, existing public datasets primarily focus on real-world scenarios with global descriptions, and lack reference images for consistent character guidance. To bridge this gap, we present AnimeShooter, a reference-guided multi-shot animation dataset. AnimeShooter features comprehensive hierarchical annotations and strong visual consistency across shots through an automated pipeline. Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images, while shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and…
Peer Reviews
Decision·Submitted to ICLR 2026
AnimeShooter provides structured, story-aware, and reference-guided annotations, filling a significant gap in current video-generation datasets. Clear separation between story-level and shot-level elements enables both global narrative control and local visual coherence. The open release of both dataset and baseline has high potential to become a standard benchmark for multi-shot animation generation.
1.Hierarchical captioning reduces drift but lacks visual grounding, leaving potential hallucination issues. 2.Only basic normalization is used; no explicit domain alignment across different animation styles and for object segmentation methods, fine-tuning on ~500 frames offers limited adaptation from real-world to animated content. 3.For keyframe-selection, real-video heuristics are applied; not well-suited for low frame-rate animation. 4.No explicit loss or alignment; character consistency reli
1. The paper introduces a multi-shot video dataset in the anime domain, along with a subset containing audio data, laying a foundation for advancing research in animation storytelling. 2. The paper is well-organized and highly readable, with figures concisely illustrating the workflow, data structure, and qualitative comparisons.
1. In the proposed method, visual information such as reference images and different shot contexts is encoded by the MLLM and aligned with the text embeddings of the diffusion model. This can lead to the loss of fine details from the reference images or shot scenes. As shown in Figure 5, without LoRA enhancement, the consistency of details is not satisfactory. However, many real-world workflows prefer models that do not require additional fine-tuning. 2. The baseline methods compared in this pa
1. The paper isolates a genuinely under‑served setting: reference‑guided multi‑shot animation generation with cross‑shot character/style consistency, rather than single‑shot real‑world videos or global captions. 2. The combination of story‑level elements (storyline, scenes, character cards + reference images) and fine‑grained shot‑level captions is valuable for autoregressive modeling and evaluation. 3. Conditioning on a reference image + prior last frames + text through an MLLM + Q‑Former adapt
1. The compared baselines are relatively weaker baselines (e.g., IP‑Adapter+I2V and CogVideo‑LoRA). Stronger baseline models with MLLM might also be considered. 2. The evaluation metrics with CLIP and DreamSim cannot capture some video quality aspects like motion smoothness. Better automatic evaluation metrics for these categories should be investigated (e.g., like in VBench). 3. More qualitative examples/analysis should be included for generalization to longer shots (e.g., 15 shots) to support
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
