Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Kirolos Ataallah; Xiaoqian Shen; Eslam Abdelrahman; Essam Sleiman,; Mingchen Zhuge; Jian Ding; Deyao Zhu; J\"urgen Schmidhuber; Mohamed Elhoseiny

arXiv:2407.12679·cs.CV·July 18, 2024

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman,, Mingchen Zhuge, Jian Ding, Deyao Zhu, J\"urgen Schmidhuber, Mohamed Elhoseiny

PDF

Open Access 1 Repo 2 Models 1 Datasets

TL;DR

Goldfish introduces an efficient method for understanding arbitrarily long videos using a retrieval mechanism and a new long-video benchmark, significantly improving accuracy over previous models in both long and short video comprehension.

Contribution

The paper presents Goldfish, a novel approach with a retrieval-based mechanism for long video understanding and introduces the TVQA-long benchmark for evaluating such models.

Findings

01

Achieved 41.78% accuracy on TVQA-long, surpassing previous methods by 14.94%.

02

MiniGPT4-Video performs exceptionally on short video benchmarks, exceeding state-of-the-art results.

03

Demonstrated significant improvements in both long and short video understanding tasks.

Abstract

Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as "noise and redundancy", as well as "memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Vision-CAIR/MiniGPT4-video
pytorch

Models

Datasets

Vision-CAIR/TVQA-Long
dataset· 101 dl
101 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques