Toward Cognitive Supersensing in Multimodal Large Language Model

Boyi Li; Yifan Shen; Yuanzhe Liu; Yifan Xu; Jiateng Liu; Xinzhuo Li; Zhengyuan Li; Jingyuan Zhu; Yunhan Zhong; Fangzhou Lan; Jianguo Cao; James M. Rehg; Heng Ji; Ismini Lourentzou; Xu Cao

arXiv:2602.01541·cs.CV·February 3, 2026

Toward Cognitive Supersensing in Multimodal Large Language Model

Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu, Xinzhuo Li, Zhengyuan Li, Jingyuan Zhu, Yunhan Zhong, Fangzhou Lan, Jianguo Cao, James M. Rehg, Heng Ji, Ismini Lourentzou, Xu Cao

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces Cognitive Supersensing, a training paradigm for multimodal large language models that incorporates visual imagery capabilities to enhance complex cognitive reasoning, demonstrated by superior performance on a new VQA benchmark.

Contribution

It proposes a novel training method integrating visual latent prediction and reinforcement learning to improve cognitive reasoning in MLLMs, along with a comprehensive benchmark for evaluation.

Findings

01

Significant performance improvements on CogSense-Bench.

02

Enhanced generalization on out-of-domain VQA tasks.

03

Demonstrated importance of visual imagery in cognitive reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
PediaMedAI/CogSense-8B
model· 10 dl· ♡ 1
10 dl♡ 1

Datasets

PediaMedAI/CogSense-Bench
dataset· 8.6k dl
8.6k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Visual Attention and Saliency Detection