Rethinking Video-Text Understanding: Retrieval from Counterfactually   Augmented Data

Wufei Ma; Kai Li; Zhongshi Jiang; Moustafa Meshry; Qihao Liu; Huiyu; Wang; Christian H\"ane; and Alan Yuille

arXiv:2407.13094·cs.CV·July 19, 2024

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Wufei Ma, Kai Li, Zhongshi Jiang, Moustafa Meshry, Qihao Liu, Huiyu, Wang, Christian H\"ane, and Alan Yuille

PDF

Open Access

TL;DR

This paper introduces a new evaluation task and dataset for video-text understanding, revealing current models' limitations and proposing a large language model-based approach to improve their comprehension of actions in videos.

Contribution

The paper proposes RCAD, a novel evaluation task with the Feint6K dataset, and introduces LLM-teacher, a method leveraging large language models to enhance video-text model understanding.

Findings

01

Current models are easily fooled by counterfactual data.

02

Models lag behind human performance on RCAD.

03

LLM-teacher improves action semantics learning.

Abstract

Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Natural Language Processing Techniques