Language as the Medium: Multimodal Video Classification through text   only

Laura Hanu; Anita L. Ver\H{o}; James Thewlis

arXiv:2309.10783·cs.CV·September 20, 2023·1 cites

Language as the Medium: Multimodal Video Classification through text only

Laura Hanu, Anita L. Ver\H{o}, James Thewlis

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot multimodal video classification approach that uses detailed textual descriptions generated from visual and audio data, leveraging large language models to interpret complex video context without additional fine-tuning.

Contribution

The paper presents a model-agnostic method that employs large language models to reason about multimodal video descriptions for zero-shot classification, bypassing the need for fine-tuning.

Findings

01

Effective zero-shot classification on UCF-101 and Kinetics datasets

02

Textual descriptions serve as effective proxies for sight and hearing

03

Demonstrates the potential of combining textual, visual, and auditory models for holistic video understanding

Abstract

Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Attention Dropout · Residual Connection · Adam · Linear Layer · Weight Decay · Multi-Head Attention