SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment

Xinyu Mao; Junsi Li; Haoji Zhang; Yu Liang; Ming Sun

arXiv:2511.01390·cs.CV·November 4, 2025

SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment

Xinyu Mao, Junsi Li, Haoji Zhang, Yu Liang, Ming Sun

PDF

Open Access 3 Reviews

TL;DR

SEPS is a novel framework that enhances fine-grained cross-modal alignment by systematically reducing patch redundancy and ambiguity through semantic integration and relevance-aware patch selection, significantly improving retrieval performance.

Contribution

The paper introduces SEPS, a two-stage semantic-enhanced patch slimming framework that effectively addresses patch redundancy and ambiguity in cross-modal alignment tasks.

Findings

01

SEPS outperforms existing methods by 23-86% in rSum on Flickr30K and MS-COCO datasets.

02

The framework improves text-to-image retrieval accuracy across various model architectures.

03

Experimental results demonstrate the effectiveness of semantic integration and relevance-aware patch selection.

Abstract

Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic generation capabilities. However, the dense textual outputs from MLLMs may introduce conflicts with the original sparse captions. Furthermore, accurately quantifying semantic relevance between rich visual patches and concise textual descriptions remains a core challenge. To overcome these limitations, we introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 5

Strengths

SEPS effectively merges MLLM-generated dense descriptions with human captions, offering unified semantic guidance for patch selection. This hybrid approach reduces redundancy and ambiguity, improving visual–textual grounding

Weaknesses

w1. Dense text is generated offline using LLaVa, but the paper does not confirm whether this model has previously seen the test images or captions. w2. Besides the potential for information leakage, another problem with using dense text during testing is efficiency. Compared to d2s-vse, which also uses dense text, the method in this paper seems to use dense text during testing as well, and the efficiency of generating dense text in real time is very low.

Reviewer 02Rating 4Confidence 4

Strengths

The problem formulation of fine-grained cross-modal alignment is well-motivated, addressing practical issues of patch redundancy and ambiguity in multi-modal retrieval.

Weaknesses

1. The paper fails to benchmark against recent state-of-the-art models (e.g., CLIP variants, SigLIP 2, FG-CLIP, FineCLIP). This makes it difficult to assess SEPS’s competitiveness in the current research landscape. Perhaps the author could validate the proposed approach by executing your fine-tuning strategy on these baselines. 2. The evaluation is restricted to standard image-text retrieval datasets (Flickr30K, MS-COCO) without testing on more complex or domain-specific scenarios, limiting the

Reviewer 03Rating 2Confidence 5

Strengths

1. This paper introduces a two-stage mechanism that incorporates unified semantic representations derived from both dense and sparse textual modalities. This mechanism eliminates potential semantic inconsistencies, enabling more accurate identification of visual patches. 2. This paper leverages external MLLM to generate dense textual descriptions. This approach effectively enriches the semantic features available for alignment, directly addressing the key limitation of sparse captions that hind

Weaknesses

1. Lack of Novelty: The proposed SDTPS bears a strong resemblance to the LAPS framework [1], which also performs patch selection guided by textual relevance computed via cross-attention between text and patch embeddings. SEPS claims novelty by incorporating dense captions from MLLMs (e.g., LLaVA) as additional textual supervision. However, beyond the inclusion of dense text, the mechanism of computing patch importance (semantic scoring + differentiable selection) closely parallels that of LAPS.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning