Exploring Vision Language Models for Multimodal and Multilingual Stance   Detection

Jake Vasilakes; Carolina Scarton; Zhixue Zhao

arXiv:2501.17654·cs.CL·January 30, 2025

Exploring Vision Language Models for Multimodal and Multilingual Stance Detection

Jake Vasilakes, Carolina Scarton, Zhixue Zhao

PDF

Open Access

TL;DR

This paper evaluates state-of-the-art Vision-Language Models on a new multilingual, multimodal stance detection dataset, revealing their reliance on text over images and their generally consistent cross-lingual predictions.

Contribution

It introduces a new dataset and provides an extensive evaluation of VLMs on multimodal and multilingual stance detection, highlighting their strengths and limitations.

Findings

01

VLMs rely more on text than images for stance detection.

02

Models tend to use text within images more than other visual cues.

03

Predictions are generally consistent across languages, with some outliers.

Abstract

Social media's global reach amplifies the spread of information, highlighting the need for robust Natural Language Processing tasks like stance detection across languages and modalities. Prior research predominantly focuses on text-only inputs, leaving multimodal scenarios, such as those involving both images and text, relatively underexplored. Meanwhile, the prevalence of multimodal posts has increased significantly in recent years. Although state-of-the-art Vision-Language Models (VLMs) show promise, their performance on multimodal and multilingual stance detection tasks remains largely unexamined. This paper evaluates state-of-the-art VLMs on a newly extended dataset covering seven languages and multimodal inputs, investigating their use of visual cues, language-specific performance, and cross-modality interactions. Our results show that VLMs generally rely more on text than images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition