Cross-Modal Scene Semantic Alignment for Image Complexity Assessment

Yuqing Luo; Yixiao Li; Jiang Liu; Jun Fu; Hadi Amirpour; Guanghui Yue; Baoquan Zhao; Padraig Corcoran; Hantao Liu; Wei Zhou

arXiv:2510.18377·cs.CV·October 22, 2025

Cross-Modal Scene Semantic Alignment for Image Complexity Assessment

Yuqing Luo, Yixiao Li, Jiang Liu, Jun Fu, Hadi Amirpour, Guanghui Yue, Baoquan Zhao, Padraig Corcoran, Hantao Liu, Wei Zhou

PDF

Open Access

TL;DR

This paper introduces a novel cross-modal scene semantic alignment method for image complexity assessment, leveraging semantic information from text-image pairs to improve alignment with human perception and outperform existing methods.

Contribution

The paper proposes a new cross-modal approach for ICA that aligns scene semantics from text and images, enhancing prediction accuracy over prior single-modality methods.

Findings

01

Significantly outperforms state-of-the-art ICA methods on multiple datasets.

02

Demonstrates the effectiveness of cross-modal semantic alignment in capturing image complexity.

03

Provides publicly available code for reproducibility.

Abstract

Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Image and Video Quality Assessment · Advanced Neural Network Applications