Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta; Marufa Kamal; Md. Mahfuzur Rahman; Fahad Rahman; Mohd Ariful Haque; Sunzida Siddique

arXiv:2511.15204·cs.CV·May 11, 2026

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique

PDF

TL;DR

This paper introduces PCMDE, a physics-constrained multimodal evaluation metric that combines vision-language models and large language models to better assess semantic and structural accuracy in synthetic images.

Contribution

It proposes a novel evaluation framework integrating physics-based reasoning with multimodal feature extraction for improved image assessment.

Findings

01

PCMDE outperforms traditional metrics in domain-specific scenarios.

02

The method effectively enforces structural and relational constraints.

03

It combines object detection, vision-language models, and LLM reasoning for comprehensive evaluation.

Abstract

Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.