MotiF: Making Text Count in Image Animation with Motion Focal Loss

Shijie Wang; Samaneh Azadi; Rohit Girdhar; Saketh Rambhatla; Chen Sun,; Xi Yin

arXiv:2412.16153·cs.CV·March 25, 2025

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun,, Xi Yin

PDF

Open Access 1 Datasets

TL;DR

MotiF introduces a motion focal loss that emphasizes motion regions in text-guided image animation, significantly enhancing text alignment and motion accuracy in generated videos.

Contribution

The paper proposes MotiF, a novel motion focal loss using optical flow to improve text-guided video generation, and introduces TI2V Bench for robust evaluation.

Findings

01

MotiF outperforms nine open-source models with 72% preference in human evaluations.

02

The motion focal loss improves alignment with text prompts and motion realism.

03

TI2V Bench provides a new dataset for comprehensive TI2V model assessment.

Abstract

Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation. We use optical flow to generate a motion heatmap and weight the loss according to the intensity of the motion. This modified objective leads to noticeable improvements and complements existing methods that utilize motion priors as model inputs. Additionally, due to the lack of a diverse benchmark for evaluating TI2V generation, we propose TI2V Bench, a dataset consists of 320 image-text pairs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

wang-sj16/TI2V-Bench
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Handwritten Text Recognition Techniques · Video Analysis and Summarization

MethodsHeatmap · ALIGN