From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Mihai Masala; Marius Leordeanu

arXiv:2507.04815·cs.CV·July 8, 2025

From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Mihai Masala, Marius Leordeanu

PDF

3 Datasets

TL;DR

This paper introduces an explainable, graph-based representation of events in space and time to improve video captioning, enabling better understanding and self-supervised training of models for generating detailed natural language descriptions.

Contribution

It proposes a novel explainable graph-based approach for connecting vision and language, and demonstrates its effectiveness as a self-supervised teacher for training video captioning models.

Findings

01

Generated coherent and relevant descriptions across multiple datasets.

02

Validated approach with standard metrics, human annotations, and ensemble consensus.

03

Enabled self-supervised training of neural models using the explainable system.

Abstract

The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.