Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

Maria Tzelepi; Vasileios Mezaris

arXiv:2406.12668·cs.CV·June 19, 2024

Disturbing Image Detection Using LMM-Elicited Emotion Embeddings

Maria Tzelepi, Vasileios Mezaris

PDF

Open Access

TL;DR

This paper introduces a novel approach for disturbing image detection by leveraging Large Multimodal Models to extract semantic and emotional features, significantly enhancing classification accuracy.

Contribution

The paper presents a new method that combines semantic descriptions and emotion embeddings from LMMs with CLIP features for improved disturbing image detection.

Findings

01

Achieved state-of-the-art accuracy on the DID dataset.

02

Effectively utilizes LMM-elicited emotions for image classification.

03

Significantly outperforms baseline methods.

Abstract

In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBrain Tumor Detection and Classification · Video Surveillance and Tracking Methods