Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time
Mihai Masala, Marius Leordeanu

TL;DR
This paper introduces an explainable, reasoning-based approach over spatiotemporal event graphs to improve zero-shot video description, bridging vision and language understanding with enhanced interpretability.
Contribution
It proposes a novel graph-based reasoning framework that connects vision and language models for explainable zero-shot video captioning.
Findings
Generates coherent and relevant video descriptions across datasets.
Achieves competitive results using standard metrics and LLM evaluation.
Enhances interpretability of video captioning models.
Abstract
In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
