Temporal Reasoning Transfer from Text to Video

Lei Li; Yuanxin Liu; Linli Yao; Peiyuan Zhang; Chenxin An; Lean Wang,; Xu Sun; Lingpeng Kong; Qi Liu

arXiv:2410.06166·cs.CV·October 10, 2024

Temporal Reasoning Transfer from Text to Video

Lei Li, Yuanxin Liu, Linli Yao, Peiyuan Zhang, Chenxin An, Lean Wang,, Xu Sun, Lingpeng Kong, Qi Liu

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper identifies the core challenge in video temporal reasoning as stemming from language model limitations and introduces a text-based transfer method that significantly improves video understanding without using video data.

Contribution

The paper proposes T3, a novel text-based training approach that enhances video temporal reasoning by transferring knowledge from textual tasks, bypassing the need for video data.

Findings

01

T3 improves LongVA-7B's accuracy by 5.3 points on TempCompass.

02

Enhanced model outperforms some models trained on extensive video data.

03

Strong correlation found between textual and video temporal reasoning performance.

Abstract

Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs' temporal reasoning capability stems from the underlying LLM's inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the Textual Temporal reasoning Transfer (T3). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Interesting approach to tackle temporal understanding of Video LLMs, that the LLM fails to understand the temporal behavior of the text prompts. - Utilizing only text data proposes an efficient and scalable framework, although the gains are questionable for larger models. - The performance gains are quite significant

Weaknesses

- The biggest limitation of this work lies in the scope of the temporal concepts, but not simply because they were not covered in the paper. For example, consider rotation, direction, counting the number of a certain event, and relative velocity of an object relative to another object. There are some temporal concepts that cannot be represented in discrete sentences - for example, how would you describe the number of rotations of a diver in the form of text? How would you describe that one perso

Reviewer 02Rating 6Confidence 4

Strengths

1. Show that visual features contain necessary information for order understanding task. 2. Writing is clear.

Weaknesses

1. Why choosing LongVA? LongVA is only trained on images and text, therefore, evaluating and improving LongVA but evaluating on video understanding is not proper. Because everything is zero-shot manner. Instead, choosing backbones like LLaVA-onevision, phi-3.5-vision, or Qwen2-VL is proper. Those models have videos as the training data. If authors still show that there is clear improvement with their text-only ordering data, then the claim can be supported. 2. Probing visual features is unfair.

Reviewer 03Rating 6Confidence 4

Strengths

1) In terms of significance, this paper addresses the problem of video and language reasoning in multimodal Large Language Models (MLLMs) from a new perspective. It attributes the limitation of existing (MLLMs) to the LLM component instead of the video or adaptor functions. This is a very interesting perspective since it proposes an approach that improves the LM's ability to understand temporal concepts through text. The proposed Textual Temporal Reasoning Transfer (T3) approach creatively uses

Weaknesses

1) The text-based tasks used in T3 are largely devised based on templates and may not always be similar to the language concepts that are contained in real-world videos. For example, such templated sentences may lack the subtle temporal cues that naturally occur in descriptions of complex events. This might affect the model's ability to generalize to different downstream tasks. 2) In this work, the authors only address four temporal aspects: order, attribute change, temporal referring, and grou

Videos

Temporal Reasoning Transfer from Text to Video· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications