Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection

Chuang Peng; Renshuai Tao; Zhongwei Ren; Xianglong Liu; Yunchao Wei

arXiv:2511.18385·cs.CV·November 25, 2025

Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection

Chuang Peng, Renshuai Tao, Zhongwei Ren, Xianglong Liu, Yunchao Wei

PDF

Open Access

TL;DR

This paper introduces a new benchmark and a multimodal model that treats second-view X-ray images as a language-like modality, enhancing prohibited item detection through cross-view and cross-modal reasoning.

Contribution

It presents DualXrayBench, a comprehensive benchmark with dual-view images and captions, and proposes GSR, a model that leverages cross-view geometry and semantics as a language-like modality for improved detection.

Findings

01

GSR significantly outperforms existing methods on X-ray detection tasks.

02

DualXrayBench provides a new dataset and evaluation framework for multi-view X-ray analysis.

03

Treating second-view images as a language-like modality improves reasoning capabilities.

Abstract

Automatic X-ray prohibited items detection is vital for security inspection and has been widely studied. Traditional methods rely on visual modality, often struggling with complex threats. While recent studies incorporate language to guide single-view images, human inspectors typically use dual-view images in practice. This raises the question: can the second view provide constraints similar to a language modality? In this work, we introduce DualXrayBench, the first comprehensive benchmark for X-ray inspection that includes multiple views and modalities. It supports eight tasks designed to test cross-view reasoning. In DualXrayBench, we introduce a caption corpus consisting of 45,613 dual-view image pairs across 12 categories with corresponding captions. Building upon these data, we propose the Geometric (cross-view)-Semantic (cross-modality) Reasoner (GSR), a multimodal model that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Radiology practices and education