RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization
Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia

TL;DR
RelayFormer is a novel unified framework that effectively localizes manipulated regions in images and videos by adaptively handling varying resolutions and modalities through a global-local relay attention mechanism, achieving state-of-the-art results.
Contribution
It introduces RelayFormer, a scalable, resolution-agnostic, and modality-unified model for visual manipulation localization using a global-local relay attention framework.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Efficiently handles arbitrary resolutions and video sequences.
Balances accuracy with computational efficiency.
Abstract
Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two main issues: resolution diversity, where resizing or padding distorts forensic traces and reduces efficiency, and the modality gap, as images and videos often require separate models. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and modalities. RelayFormer partitions inputs into fixed-size sub-images and introduces Global-Local Relay (GLR) tokens, which propagate structured context through a global-local relay attention (GLRA) mechanism. This enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior methods that rely on…
Peer Reviews
Decision·ICLR 2026 Poster
- The insight of bridging image and video manipulation localization holds novelty. It will offer a benchmark case for future multi-modal VML tasks. - The proposed method’s resolution strategy offers a significant computational efficiency advantage within the current field of manipulation detection. - The experiments in this paper are comprehensive and well-organized, effectively demonstrating the proposed claims.
- The proposed method shares some similarities with Visual Prompt Tuning[A], as it introduces additional trainable tokens to transmit information and assist decision-making. It is recommended that the authors discuss this connection in an appropriate section of the paper. - To the best of my knowledge, RoPE is introduced for the first time in a manipulation localization model. The authors may consider analyzing the advantages of this positional embedding for a pure computer vision task like mani
A new framework for unified image and video forensics. Achieves top average F1 on image benchmarks, and consistently high IoU/F1 across video inpainting methods on MOSE. Robustness curves for Gaussian blur, noise, and JPEG compression.
The proposed approach relies on fusing local and global information, a well-established technique in computer vision whose effectiveness is often assumed. While results from training on both are presented, there is no clear explanation or investigation into how these modalities mutually influence each other. An intuitive hypothesis is that images can be considered single frames of videos, and training on image forgery could enhance video forgery detection (and vice-versa). To truly demonstrate
1.The research problem is well-motivated. The paper correctly identifies that manipulation detection requires simultaneous consideration of both fine-grained local artifacts and coarse-grained global consistency cues (such as illumination mismatches and structural redundancies). This provides a sound foundation for proposing an efficient global information propagation mechanism and offers valuable insights for the forensics community. 2.The input unification strategy (Section 3.1) is practical.
1. The authors emphasize that images and videos belong to different modalities, which is imprecise. Multimodality typically refers to different data types (e.g., vision and text, vision and audio). Both images and videos belong to the visual modality, differing only in the presence or absence of temporal dimension. This conceptual confusion weakens the theoretical positioning of the paper. It would be more appropriate to frame this as a "temporal dimension extension problem" rather than a "modal
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Advanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis
