SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models
Hamza Tahboub, Weiyan Shi, Gang Hua, Huaizu Jiang

TL;DR
This paper identifies that current vision-language models suffer from social degradation due to pre-training, and proposes SocialFusion, a framework that improves social perception across multiple tasks by minimally connecting a frozen visual encoder with a language model.
Contribution
The paper introduces SocialFusion, a novel framework that mitigates social degradation in VLMs by minimally connecting a frozen visual encoder with a language model, enabling positive transfer across social tasks.
Findings
SocialFusion outperforms existing VLMs on five social perception tasks.
It achieves performance comparable to task-specific models on benchmarks.
Current pre-training strategies may hinder social competence acquisition.
Abstract
Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
