TL;DR
This paper introduces stacked temporal attention modules within vision encoders of Video-LLMs, significantly improving their ability to understand complex temporal dynamics in videos and outperforming existing models on key benchmarks.
Contribution
The novel architecture integrates stacked temporal attention into vision encoders, enhancing temporal reasoning in Video-LLMs for better action sequence comprehension.
Findings
Improved performance on VITATECS, MVBench, Video-MME benchmarks by up to +5.5%.
Enhanced temporal reasoning in video question answering tasks.
Addresses a critical gap in video understanding for Video-LLMs.
Abstract
Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
