Align and Surpass Human Camouflaged Perception: Visual Refocus Reinforcement Fine-Tuning
Ruolin Shen, Xiaozhong Ji, Kai WU, Jiangning Zhang, Yijun He, HaiHua Yang, Xiaobin Hu, Xiaoyu Sun

TL;DR
This paper introduces a visual refocus reinforcement framework that enables multi-modal models to better identify camouflaged objects by mimicking human visual perception, achieving superior reasoning and detection performance.
Contribution
The paper proposes a novel visual refocus reinforcement learning approach that improves multi-modal models' ability to perceive camouflaged objects, surpassing existing fine-tuning methods.
Findings
Emergence of refocus visual phenomena with multiple reasoning tokens
Significant performance improvements in camouflaged object classification
Enhanced detection accuracy over baseline methods
Abstract
Current multi-modal models exhibit a notable misalignment with the human visual system when identifying objects that are visually assimilated into the background. Our observations reveal that these multi-modal models cannot distinguish concealed objects, demonstrating an inability to emulate human cognitive processes which effectively utilize foreground-background similarity principles for visual analysis. To analyze this hidden human-model visual thinking discrepancy, we build a visual system that mimicks human visual camouflaged perception to progressively and iteratively `refocus' visual concealed content. The refocus is a progressive guidance mechanism enabling models to logically localize objects in visual images through stepwise reasoning. The localization process of concealed objects requires hierarchical attention shifting with dynamic adjustment and refinement of prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Visual perception and processing mechanisms · Image and Video Quality Assessment
MethodsSoftmax · Attention Is All You Need · ALIGN
