Demographic User Modeling for Social Robotics with Multimodal Pre-trained Models
Hamed Rahimi, Mouad Abrini, Mahdi Khoramshahi, and Mohamed Chetouani

TL;DR
This paper evaluates the use of multimodal pre-trained models, specifically CLIP, for demographic user profiling in social robotics, introduces new datasets, and proposes a masked image modeling strategy to improve demographic attribute recognition.
Contribution
It introduces two new datasets for demographic profiling and proposes a masked image modeling approach to enhance generalization in multimodal user modeling.
Findings
CLIP performs poorly without fine-tuning on demographic tasks.
Fine-tuning improves CLIP's performance but limitations remain.
Masked image modeling can potentially enhance demographic attribute recognition.
Abstract
This paper investigates the performance of multimodal pre-trained models in user profiling tasks based on visual-linguistic demographic data. These models are critical for adapting to the needs and preferences of human users in social robotics, thereby providing personalized responses and enhancing interaction quality. First, we introduce two datasets specifically curated to represent demographic characteristics derived from user facial images. Next, we evaluate the performance of a prominent contrastive multimodal pre-trained model, CLIP, on these datasets, both in its out-of-the-box state and after fine-tuning. Initial results indicate that CLIP performs suboptimal in matching images to demographic descriptions without fine-tuning. Although fine-tuning significantly enhances its predictive capacity, the model continues to exhibit limitations in effectively generalizing subtle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Mobility and Location-Based Analysis · Social Robot Interaction and HRI · Context-Aware Activity Recognition Systems
MethodsContrastive Language-Image Pre-training
