Look, Remember and Reason: Grounded reasoning in videos with language   models

Apratim Bhattacharyya; Sunny Panchal; Mingu Lee; Reza Pourreza; Pulkit; Madan; Roland Memisevic

arXiv:2306.17778·cs.CV·January 23, 2024

Look, Remember and Reason: Grounded reasoning in videos with language models

Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit, Madan, Roland Memisevic

PDF

Open Access 1 Video

TL;DR

This paper introduces a grounded reasoning framework for videos using language models trained on low-level visual tasks, enabling better causal and compositional reasoning in videos.

Contribution

It proposes an end-to-end training approach that combines low-level visual skills with language models to improve video reasoning capabilities.

Findings

01

Outperforms state-of-the-art methods on multiple datasets

02

Effective in causal and compositional spatiotemporal reasoning

03

Utilizes a two-stream video encoder with spatiotemporal attention

Abstract

Multi-modal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Look, Remember and Reason: Grounded Reasoning in Videos with Language Models· slideslive

Taxonomy

TopicsSpeech and dialogue systems · Qualitative Research Methods and Applications · Educational Tools and Methods