FocusChat: Text-guided Long Video Understanding via Spatiotemporal Information Filtering
Zheng Cheng, Rendong Wang, Zhicheng Wang

TL;DR
FocusChat is a text-guided multi-modal model for long video understanding that filters visual information to reduce noise and computation, achieving high performance with minimal visual tokens and training data.
Contribution
It introduces a novel spatial-temporal filtering module that aligns visual tokens with user prompts, improving efficiency and accuracy in long video analysis.
Findings
Outperforms Video-LLaMA in zero-shot tasks with fewer visual tokens
Achieves state-of-the-art results in few-shot experiments with minimal pre-training data
Reduces computational load by focusing on relevant visual information
Abstract
Recently, multi-modal large language models have made significant progress. However, visual information lacking of guidance from the user's intention may lead to redundant computation and involve unnecessary visual noise, especially in long, untrimmed videos. To address this issue, we propose FocusChat, a text-guided multi-modal large language model (LLM) that emphasizes visual information correlated to the user's prompt. In detail, Our model first undergoes the semantic extraction module, which comprises a visual semantic branch and a text semantic branch to extract image and text semantics, respectively. The two branches are combined using the Spatial-Temporal Filtering Module (STFM). STFM enables explicit spatial-level information filtering and implicit temporal-level feature filtering, ensuring that the visual tokens are closely aligned with the user's query. It lowers the essential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
