FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering

Zheng Cheng; Rendong Wang; Zhicheng Wang

arXiv:2412.12833·cs.CV·May 29, 2025

FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering

Zheng Cheng, Rendong Wang, Zhicheng Wang

PDF

Open Access

TL;DR

FocusChat is a text-guided multi-modal model for long video understanding that filters visual information to reduce noise and computation, achieving high performance with minimal visual tokens and training data.

Contribution

It introduces a novel spatial-temporal filtering module that aligns visual tokens with user prompts, improving efficiency and accuracy in long video analysis.

Findings

01

Outperforms Video-LLaMA in zero-shot tasks with fewer visual tokens

02

Achieves state-of-the-art results in few-shot experiments with minimal pre-training data

03

Reduces computational load by focusing on relevant visual information

Abstract

Recently, multi-modal large language models have made significant progress. However, visual information lacking of guidance from the user's intention may lead to redundant computation and involve unnecessary visual noise, especially in long, untrimmed videos. To address this issue, we propose FocusChat, a text-guided multi-modal large language model (LLM) that emphasizes visual information correlated to the user's prompt. In detail, Our model first undergoes the semantic extraction module, which comprises a visual semantic branch and a text semantic branch to extract image and text semantics, respectively. The two branches are combined using the Spatial-Temporal Filtering Module (STFM). STFM enables explicit spatial-level information filtering and implicit temporal-level feature filtering, ensuring that the visual tokens are closely aligned with the user's query. It lowers the essential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques