Sakuga-42M Dataset: Scaling Up Cartoon Research
Zhenglin Pan

TL;DR
This paper introduces Sakuga-42M, the largest cartoon animation dataset with 42 million keyframes, enabling improved understanding and generation of cartoons through large-scale model fine-tuning.
Contribution
The creation of Sakuga-42M, a comprehensive large-scale cartoon dataset with extensive annotations, and demonstrating its effectiveness in enhancing cartoon comprehension and generation tasks.
Findings
Fine-tuning foundation models on Sakuga-42M improves cartoon task performance.
The dataset covers diverse artistic styles, regions, and years.
Large-scale cartoon data benefits model generalization and robustness.
Abstract
Hand-drawn cartoon animation employs sketches and flat-color segments to create the illusion of motion. While recent advancements like CLIP, SVD, and Sora show impressive results in understanding and generating natural video by scaling large models with extensive datasets, they are not as effective for cartoons. Through our empirical experiments, we argue that this ineffectiveness stems from a notable bias in hand-drawn cartoons that diverges from the distribution of natural videos. Can we harness the success of the scaling paradigm to benefit cartoon research? Unfortunately, until now, there has not been a sizable cartoon dataset available for exploration. In this research, we propose the Sakuga-42M Dataset, the first large-scale cartoon animation dataset. Sakuga-42M comprises 42 million keyframes covering various artistic styles, regions, and years, with comprehensive semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComics and Graphic Narratives
MethodsContrastive Language-Image Pre-training
