Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

Ruolin Shen; Xiaozhong Ji; Kai WU; Jiangning Zhang; Yijun He; HaiHua Yang; Xiaobin Hu; Xiaoyu Sun

arXiv:2505.19611·cs.CV·May 27, 2025

Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning

Ruolin Shen, Xiaozhong Ji, Kai WU, Jiangning Zhang, Yijun He, HaiHua Yang, Xiaobin Hu, Xiaoyu Sun

PDF

Open Access 1 Repo

TL;DR

This paper introduces a visual refocus reinforcement framework that enables multi-modal models to better identify camouflaged objects by mimicking human visual perception, achieving superior reasoning and detection performance.

Contribution

The paper proposes a novel visual refocus reinforcement learning approach that improves multi-modal models' ability to perceive camouflaged objects, surpassing existing fine-tuning methods.

Findings

01

Emergence of refocus visual phenomena with multiple reasoning tokens

02

Significant performance improvements in camouflaged object classification

03

Enhanced detection accuracy over baseline methods

Abstract

Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huuxiaobin/vrrf
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Visual perception and processing mechanisms · Image and Video Quality Assessment

MethodsSoftmax · Attention Is All You Need · ALIGN