MObyGaze: a film dataset of multimodal objectification densely annotated by experts

Julie Tores; Elisa Ancarani; Lucile Sassatelli; Hui-Yin Wu; Clement Bergman; Lea Andolfi; Victor Ecrement; Remy Sun; Frederic Precioso; Thierry Devars; Magali Guaresi; Virginie Julliard; Sarah Lecossais

arXiv:2505.22084·cs.CV·May 29, 2025

MObyGaze: a film dataset of multimodal objectification densely annotated by experts

Julie Tores, Elisa Ancarani, Lucile Sassatelli, Hui-Yin Wu, Clement Bergman, Lea Andolfi, Victor Ecrement, Remy Sun, Frederic Precioso, Thierry Devars, Magali Guaresi, Virginie Julliard, Sarah Lecossais

PDF

Open Access

TL;DR

This paper introduces MObyGaze, a densely annotated multimodal film dataset for analyzing objectification, and explores AI methods to characterize and quantify complex gender representation patterns in movies.

Contribution

It presents a new multimodal dataset with expert annotations on objectification in films and investigates learning approaches for this complex, multi-label task.

Findings

01

Feasibility of modeling objectification using multimodal data

02

Benchmark results for vision, text, and audio models on the dataset

03

Effective methods for learning from diverse, expert-annotated labels

Abstract

Characterizing and quantifying gender representation disparities in audiovisual storytelling contents is necessary to grasp how stereotypes may perpetuate on screen. In this article, we consider the high-level construct of objectification and introduce a new AI task to the ML community: characterize and quantify complex multimodal (visual, speech, audio) temporal patterns producing objectification in films. Building on film studies and psychology, we define the construct of objectification in a structured thesaurus involving 5 sub-constructs manifesting through 11 concepts spanning 3 modalities. We introduce the Multimodal Objectifying Gaze (MObyGaze) dataset, made of 20 movies annotated densely by experts for objectification levels and concepts over freely delimited segments: it amounts to 6072 segments over 43 hours of video with fine-grained localization and categorization. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications