Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models

Aya Nakayama; Brian Wong; Yuji Nishimura; Kaito Tanaka

arXiv:2510.22838·cs.CV·October 28, 2025

Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models

Aya Nakayama, Brian Wong, Yuji Nishimura, Kaito Tanaka

PDF

TL;DR

This paper introduces SP-CSVR, a framework that improves multi-modal understanding in large vision-language models by effectively disentangling style from content, enabling robust semantic reasoning across diverse visual styles.

Contribution

The paper presents a novel framework with style-content disentanglement, semantic-aligned decoding, and adaptive semantic consistency, advancing robustness and generalization in multi-style visual reasoning.

Findings

01

Achieves state-of-the-art results on multi-style visual tasks

02

Enhances robustness and generalization across diverse styles

03

Demonstrates effectiveness through extensive ablation studies

Abstract

The "style trap" poses a significant challenge for Large Vision-Language Models (LVLMs), hindering robust semantic understanding across diverse visual styles, especially in in-context learning (ICL). Existing methods often fail to effectively decouple style from content, hindering generalization. To address this, we propose the Semantic-Preserving Cross-Style Visual Reasoner (SP-CSVR), a novel framework for stable semantic understanding and adaptive cross-style visual reasoning. SP-CSVR integrates a Cross-Style Feature Encoder (CSFE) for style-content disentanglement, a Semantic-Aligned In-Context Decoder (SAICD) for efficient few-shot style adaptation, and an Adaptive Semantic Consistency Module (ASCM) employing multi-task contrastive learning to enforce cross-style semantic invariance. Extensive experiments on a challenging multi-style dataset demonstrate SP-CSVR's state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.