VSD2M: A Large-scale Vision-language Sticker Dataset for Multi-frame   Animated Sticker Generation

Zhiqiang Yuan; Jiapei Zhang; Ying Deng; Yeshuang Zhu; Jie Zhou,; Jinchao Zhang

arXiv:2412.08259·cs.HC·March 27, 2025

VSD2M: A Large-scale Vision-language Sticker Dataset for Multi-frame Animated Sticker Generation

Zhiqiang Yuan, Jiapei Zhang, Ying Deng, Yeshuang Zhu, Jie Zhou,, Jinchao Zhang

PDF

Open Access

TL;DR

This paper introduces VSD2M, the largest dataset for animated sticker generation, and proposes a new Spatial Temporal Interaction layer to enhance video generation methods for creating animated stickers from text prompts.

Contribution

The paper constructs the first large-scale vision-language sticker dataset VSD2M and develops a novel STI layer to improve animated sticker generation performance.

Findings

01

VSD2M contains two million static and animated stickers.

02

The STI layer improves information utilization in video generation.

03

Baseline models trained on VSD2M show promising results.

Abstract

As a common form of communication in social media,stickers win users' love in the internet scenarios, for their ability to convey emotions in a vivid, cute, and interesting way. People prefer to get an appropriate sticker through retrieval rather than creation for the reason that creating a sticker is time-consuming and relies on rule-based creative tools with limited capabilities. Nowadays, advanced text-to-video algorithms have spawned numerous general video generation systems that allow users to customize high-quality, photo-realistic videos by only providing simple text prompts. However, creating customized animated stickers, which have lower frame rates and more abstract semantics than videos, is greatly hindered by difficulties in data acquisition and incomplete benchmarks. To facilitate the exploration of researchers in animated sticker generation (ASG) field, we firstly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInteractive and Immersive Displays · Augmented Reality Applications · Human Motion and Animation