Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off

Longyin Zhang; Shuo Sun; Yingxu He; Won Cheng Yi Lewis; Muhammad Huzaifah Bin Md Shahrin; Hardik Bhupendra Sailor; Heng Meng Jeremy Wong; Tarun Kumar Vangani; Yi Ma; Qiongqiong Wang; Minh Duc Pham; Ridong Jiang; Jingtao Li; Jingyi Liao; Zhuohan Liu; Yanfeng Lu; Manas Gupta; Ai Ti Aw

arXiv:2602.23730·cs.AI·March 2, 2026

Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off

Longyin Zhang, Shuo Sun, Yingxu He, Won Cheng Yi Lewis, Muhammad Huzaifah Bin Md Shahrin, Hardik Bhupendra Sailor, Heng Meng Jeremy Wong, Tarun Kumar Vangani, Yi Ma, Qiongqiong Wang, Minh Duc Pham, Ridong Jiang, Jingtao Li, Jingyi Liao, Zhuohan Liu, Yanfeng Lu, Manas Gupta

PDF

Open Access

TL;DR

This paper introduces MERaLiON2-Omni, a multilingual multimodal model designed for Southeast Asia, which explicitly separates perception and reasoning modules, and analyzes the trade-offs and challenges in integrating sensory grounding with complex reasoning.

Contribution

It presents a novel training pipeline that decouples perception and reasoning, introduces a region-specific multimodal dataset, and provides a detailed analysis of the perception-logic trade-off in multimodal models.

Findings

01

Reasoning enhances performance on abstract tasks but causes instability in sensory processing.

02

Identifies temporal drift in audio and visual over-interpretation as key issues.

03

Proposes a cost-effective data synthesis pipeline using a Super-LLM.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly for underrepresented regions. In this report, we introduce the research preview of MERaLiON2-Omni (Alpha), a 10B-parameter multilingual omni-perception tailored for Southeast Asia (SEA). We present a progressive training pipeline that explicitly decouples and then integrates "System 1" (Perception) and "System 2" (Reasoning) capabilities. First, we establish a robust Perception Backbone by aligning region-specific audio-visual cues (e.g., Singlish code-switching, local cultural landmarks) with a multilingual LLM through orthogonal modality adaptation. Second, to inject cognitive capabilities without large-scale supervision, we propose a cost-effective Generate-Judge-Refine pipeline. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Multisensory perception and integration · Subtitles and Audiovisual Media