Look, Remember and Reason: Grounded reasoning in videos with language models
Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit, Madan, Roland Memisevic

TL;DR
This paper introduces a grounded reasoning framework for videos using language models trained on low-level visual tasks, enabling better causal and compositional reasoning in videos.
Contribution
It proposes an end-to-end training approach that combines low-level visual skills with language models to improve video reasoning capabilities.
Findings
Outperforms state-of-the-art methods on multiple datasets
Effective in causal and compositional spatiotemporal reasoning
Utilizes a two-stream video encoder with spatiotemporal attention
Abstract
Multi-modal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems · Qualitative Research Methods and Applications · Educational Tools and Methods
