The Social Gaze of LLMs: A Literature Review of Multimodal Approaches to Human Behavior Understanding
Zihan Liu, Parisa Rabbani, Veda Duddu, Kyle Fan, Madison Lee, Yun Huang

TL;DR
This literature review analyzes how multimodal large language models interpret human behavior, highlighting current practices, limitations, and ethical considerations, and proposes a research agenda for more socially competent systems.
Contribution
It systematically reviews 176 studies, identifying gaps in adaptive reasoning, evaluation methods, and ethical focus, and suggests directions for future research in socially aware multimodal systems.
Findings
Predominant use of pattern recognition and information extraction.
Limited support for adaptive, interactive reasoning.
Evaluation mainly relies on static benchmarks, with few human-centered assessments.
Abstract
LLM-powered multimodal systems are increasingly used to interpret human behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
