Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents
Shuting Wang, Yunqi Liu, Zixin Yang, Ning Hu, Zhicheng Dou, Chenyan Xiong

TL;DR
This paper introduces RealVideoQuest, a new benchmark for evaluating text-to-video models on real-world queries requiring visual responses, revealing current models' limitations and guiding future research.
Contribution
It creates a comprehensive benchmark with real user queries and a multi-angle evaluation system to assess T2V models' ability to handle visually grounded questions.
Findings
Current T2V models perform poorly on real user queries.
The benchmark includes 7.5K user queries and 4.5K high-quality video pairs.
Identifies key challenges and future directions for multimodal AI.
Abstract
Querying generative AI models, e.g., large language models (LLMs), has become a prevalent method for information acquisition. However, existing query-answer datasets primarily focus on textual responses, making it challenging to address complex user queries that require visual demonstrations or explanations for better understanding. To bridge this gap, we construct a benchmark, RealVideoQuest, designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries. It identifies 7.5K real user queries with video response intents from Chatbot-Arena and builds 4.5K high-quality query-video pairs through a multistage video retrieval and refinement process. We further develop a multi-angle evaluation system to assess the quality of generated video answers. Experiments indicate that current T2V models struggle with effectively addressing real user…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersona Design and Applications · Digital Storytelling and Education · Subtitles and Audiovisual Media
