# CLIP-RL: Closed-Loop Video Inpainting with Detection-Guided Reinforcement Learning

**Authors:** Meng Wang, Jing Ren, Bing Wang, Xueping Tang

PMC · DOI: 10.3390/s26020447 · Sensors (Basel, Switzerland) · 2026-01-09

## TL;DR

This paper introduces CLIP-RL, a new video inpainting framework using reinforcement learning to improve inpainting quality and consistency over time.

## Contribution

The novel contribution is applying reinforcement learning to video inpainting with a closed-loop framework for adaptive strategy optimization.

## Key findings

- CLIP-RL improves PSNR and SSIM metrics on the YouTube-VOS dataset compared to ProPainter.
- Qualitative analysis shows CLIP-RL excels in detail preservation and artifact suppression.
- The framework uses a policy network and composite reward function for dynamic inpainting strategy adjustments.

## Abstract

Existing video inpainting methods typically combine optical flow propagation with Transformer architectures, achieving promising inpainting results. However, they lack adaptive inpainting strategy optimization in diverse scenarios, and struggle to capture high-level temporal semantics, causing temporal inconsistencies and quality degradation. To address these challenges, we make one of the first attempts to introduce reinforcement learning into the video inpainting domain, establishing a closed-loop framework named CLIP-RL that enables adaptive strategy optimization. Specifically, video inpainting is reformulated as an agent–environment interaction, where the inpainting module functions as the agent’s execution component, and a pre-trained inpainting detection module provides real-time quality feedback. Guided by a policy network and a composite reward function that incorporates a weighted temporal alignment loss, the agent dynamically selects actions to adjust the inpainting strategy and iteratively refines the inpainting results. Compared to ProPainter, CLIP-RL improves PSNR from 34.43 to 34.67 and SSIM from 0.974 to 0.986 on the YouTube-VOS dataset. Qualitative analysis demonstrates that CLIP-RL excels in detail preservation and artifact suppression, validating its superiority in video inpainting tasks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12845982/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12845982/full.md

## References

34 references — full list in the complete paper: https://tomesphere.com/paper/PMC12845982/full.md

---
Source: https://tomesphere.com/paper/PMC12845982