Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

Aditya K Surikuchi; Raquel Fern\'andez; Sandro Pezzelle

arXiv:2601.16333·cs.CV·March 6, 2026

Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

Aditya K Surikuchi, Raquel Fern\'andez, Sandro Pezzelle

PDF

Open Access 1 Datasets

TL;DR

This paper evaluates foundation models' ability to identify key moments in football videos, revealing current models' limitations and emphasizing the need for modular architectures and improved training for better multimodal understanding.

Contribution

The study introduces a new dataset based on football highlight reels and systematically assesses models' ability to recognize important sub-events, highlighting their reliance on single modalities.

Findings

01

Models perform near chance level in importance recognition.

02

Current models tend to depend on dominant single modalities.

03

Modular architectures and enhanced training are needed for better multimodal understanding.

Abstract

Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

akskuchi/MOMENTS
dataset· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Video Analysis and Summarization · Multimodal Machine Learning Applications