Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman,, Mingchen Zhuge, Jian Ding, Deyao Zhu, J\"urgen Schmidhuber, Mohamed Elhoseiny

TL;DR
Goldfish introduces an efficient method for understanding arbitrarily long videos using a retrieval mechanism and a new long-video benchmark, significantly improving accuracy over previous models in both long and short video comprehension.
Contribution
The paper presents Goldfish, a novel approach with a retrieval-based mechanism for long video understanding and introduces the TVQA-long benchmark for evaluating such models.
Findings
Achieved 41.78% accuracy on TVQA-long, surpassing previous methods by 14.94%.
MiniGPT4-Video performs exceptionally on short video benchmarks, exceeding state-of-the-art results.
Demonstrated significant improvements in both long and short video understanding tasks.
Abstract
Most current LLM-based models for video understanding can process videos within minutes. However, they struggle with lengthy videos due to challenges such as "noise and redundancy", as well as "memory and computation" constraints. In this paper, we present Goldfish, a methodology tailored for comprehending videos of arbitrary lengths. We also introduce the TVQA-long benchmark, specifically designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content. Goldfish approaches these challenges with an efficient retrieval mechanism that initially gathers the top-k video clips relevant to the instruction before proceeding to provide the desired response. This design of the retrieval mechanism enables the Goldfish to efficiently process arbitrarily long video sequences, facilitating its application in contexts such as movies or television…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
