Towards Zero-Shot & Explainable Video Description by Reasoning over   Graphs of Events in Space and Time

Mihai Masala; Marius Leordeanu

arXiv:2501.08460·cs.CV·January 16, 2025

Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time

Mihai Masala, Marius Leordeanu

PDF

Open Access

TL;DR

This paper introduces an explainable, reasoning-based approach over spatiotemporal event graphs to improve zero-shot video description, bridging vision and language understanding with enhanced interpretability.

Contribution

It proposes a novel graph-based reasoning framework that connects vision and language models for explainable zero-shot video captioning.

Findings

01

Generates coherent and relevant video descriptions across datasets.

02

Achieves competitive results using standard metrics and LLM evaluation.

03

Enhances interpretability of video captioning models.

Abstract

In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization