RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation
Yuefan Cao, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha,, Zhenmei Shi, Zhao Song

TL;DR
RichSpace introduces an interpolation-based method in text embedding space to improve text-to-video generation, enabling more accurate and complex video outputs by selecting optimal embeddings.
Contribution
The paper presents a novel interpolation technique in text embedding space and a simple algorithm for selecting optimal embeddings to enhance text-to-video generation.
Findings
Improved video generation with complex features.
Effective selection of embeddings via perpendicular foot and cosine similarity.
Enhanced control over generated video content.
Abstract
Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Handwritten Text Recognition Techniques
