Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
Aditya K Surikuchi, Raquel Fern\'andez, Sandro Pezzelle

TL;DR
This paper evaluates foundation models' ability to identify key moments in football videos, revealing current models' limitations and emphasizing the need for modular architectures and improved training for better multimodal understanding.
Contribution
The study introduces a new dataset based on football highlight reels and systematically assesses models' ability to recognize important sub-events, highlighting their reliance on single modalities.
Findings
Models perform near chance level in importance recognition.
Current models tend to depend on dominant single modalities.
Modular architectures and enhanced training are needed for better multimodal understanding.
Abstract
Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Video Analysis and Summarization · Multimodal Machine Learning Applications
