Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social Situations

Eunkyu Park; Wesley Hanwen Deng; Gunhee Kim; Motahhare Eslami; Maarten Sap

arXiv:2507.20409·cs.CL·April 21, 2026

Cognitive Chain-of-Thought (CoCoT): Structured Multimodal Reasoning about Social Situations

Eunkyu Park, Wesley Hanwen Deng, Gunhee Kim, Motahhare Eslami, Maarten Sap

PDF

TL;DR

The paper introduces CoCoT, a structured reasoning framework for multimodal social tasks, improving model interpretability and performance by mimicking cognitive stages of perception, situation understanding, and norm application.

Contribution

It proposes a novel three-stage reasoning framework for multimodal social tasks, enhancing model performance and interpretability without explicit prompting at inference.

Findings

01

CoCoT improves task accuracy by 4.6% to 5.9% across multiple social reasoning tasks.

02

Supervised fine-tuning on CoCoT traces yields 5-6% performance gains.

03

Structured reasoning enhances model interpretability and social alignment.

Abstract

Chain-of-Thought (CoT) prompting helps models think step by step. But naive CoT breaks down in visually grounded social tasks, where models must perceive, understand, and judge all at once; bridging perception with norm-grounded reasoning. Recent work has introduced structured reasoning for multi-turn agent planning and visual QA, decomposing tasks into sequential sub-goals. To extend this to single-shot multimodal social reasoning, we introduce Cognitive Chain-of-Thought (CoCoT), a reasoning framework that structures vision-language-model (VLM) reasoning through three cognitively inspired stages: Perception (extract grounded facts), Situation (infer situations), and Norm (applying social norms). Evaluation across multiple distinct tasks such as multimodal intent disambiguation, multimodal theory of mind, social commonsense reasoning, and safety instruction following, shows consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.