Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions

Aditya K Surikuchi; Raquel Fern\'andez; Sandro Pezzelle

arXiv:2502.13034·cs.CL·August 21, 2025

Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions

Aditya K Surikuchi, Raquel Fern\'andez, Sandro Pezzelle

PDF

Open Access

TL;DR

This paper reviews the state-of-the-art in natural language generation from visual events, emphasizing the importance of modeling interactions between visual sequences and language, and discusses open challenges and future research directions.

Contribution

It presents a unified perspective on various visual-language tasks, surveys recent approaches, and highlights key open questions in modeling visual events for language generation.

Findings

01

Multiple tasks are unified under the broader problem of modeling visual event-language interactions.

02

Current approaches face common challenges in understanding temporal and multimodal relationships.

03

The paper identifies open research questions and suggests future directions for the field.

Abstract

In recent years, a substantial body of work in visually grounded natural language processing has focused on real-life multimodal scenarios such as describing content depicted in images or videos. However, comparatively less attention has been devoted to study the nature and degree of interaction between the different modalities in these scenarios. In this paper, we argue that any task dealing with natural language generation from sequences of images or frames is an instance of the broader, more general problem of modeling the intricate relationships between visual events unfolding over time and the features of the language used to interpret, describe, or narrate them. Therefore, solving these tasks requires models to be capable of identifying and managing such intricacies. We consider five seemingly different tasks, which we argue are compelling instances of this broader multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training