Language as the Medium: Multimodal Video Classification through text only
Laura Hanu, Anita L. Ver\H{o}, James Thewlis

TL;DR
This paper introduces a novel zero-shot multimodal video classification approach that uses detailed textual descriptions generated from visual and audio data, leveraging large language models to interpret complex video context without additional fine-tuning.
Contribution
The paper presents a model-agnostic method that employs large language models to reason about multimodal video descriptions for zero-shot classification, bypassing the need for fine-tuning.
Findings
Effective zero-shot classification on UCF-101 and Kinetics datasets
Textual descriptions serve as effective proxies for sight and hearing
Demonstrates the potential of combining textual, visual, and auditory models for holistic video understanding
Abstract
Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Attention Dropout · Residual Connection · Adam · Linear Layer · Weight Decay · Multi-Head Attention
