TL;DR
This paper evaluates the effectiveness of visual language models and clustering techniques in analyzing social media videos related to climate change, providing insights and practical guidance despite current model limitations.
Contribution
It introduces a comprehensive evaluation of zero-shot visual classification and clustering methods on climate-related social media videos, highlighting their strengths and limitations.
Findings
ConvNeXt V2 and DINOv2 produce meaningful visual clusters.
VLMs currently cannot detect climate change-specific classes.
Clustering reveals distinct visual patterns in social media videos.
Abstract
The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we advance automated visual theme detection by assessing zero-shot and clustering capabilities on social media data. (1) We evaluated the capabilities of notable VLMs such as VideoChatGPT, PandaGPT, and VideoLLava using zero-shot image classification and compared their performance to the baseline provided by frame-wise CLIP image classification. (2) By treating clustering as a minimum cost multicut problem, we aim to uncover insightful patterns in an unsupervised manner. For both analysis strategies, we provide extensive evaluations and practical guidance to practitioners. While VLMs are currently not able to detect climate change specific classes, the clustering results are distinct visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
