Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
Wufei Ma, Kai Li, Zhongshi Jiang, Moustafa Meshry, Qihao Liu, Huiyu, Wang, Christian H\"ane, and Alan Yuille

TL;DR
This paper introduces a new evaluation task and dataset for video-text understanding, revealing current models' limitations and proposing a large language model-based approach to improve their comprehension of actions in videos.
Contribution
The paper proposes RCAD, a novel evaluation task with the Feint6K dataset, and introduces LLM-teacher, a method leveraging large language models to enhance video-text model understanding.
Findings
Current models are easily fooled by counterfactual data.
Models lag behind human performance on RCAD.
LLM-teacher improves action semantics learning.
Abstract
Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques
