STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing
Junsung Lee, Junoh Kang, Bohyung Han

TL;DR
STR-Match is a training-free video editing method that enhances temporal coherence and visual fidelity by modeling spatiotemporal pixel relevance, outperforming existing approaches without requiring additional training.
Contribution
It introduces a novel STR score for spatiotemporal relevance, enabling effective latent optimization in text-to-video diffusion models without 3D attention.
Findings
Outperforms existing methods in visual quality
Maintains temporal consistency across frames
Handles significant domain transformations effectively
Abstract
Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Belows are strong points that this paper has: 1. The paper is clear and enjoyable to read, providing a straightforward explanation of the motivation and methodology. 2. The proposed STR score matching between source and target is a novel yet simple approach that can be easily applied to any pre-trained text-to-video (T2V) model. 3. The experiments are thoughtfully designed and effectively demonstrate the strength and validity of the proposed method.
1. Although the authors present computational complexity in Appendix B, details regarding inference time and additional FLOPs should be included to enable a more comprehensive comparison. 2. The current STR score combines spatial and temporal attention components. To better demonstrate the complementary effects between these two aspects, it would be helpful for the authors to include an ablation study—such as using the full STR score, spatial-only, and temporal-only variants.
- The paper proposes a v2v editing method that is model-training free. - The paper conducts extensive comparison with sota baselines and show superiority against them. - The paper conducts comprehensive ablation studies.
- The paper is based on DDS (Delta denoising Score, ICCV 2023) framework, and extends the score-based text-guided editing to t2v setup. Thus, naive extension of DDS on t2v models should be ablated. - The paper has very close philosophy as DreamMotion (ECCV 2024), which is also a training-free, score distillation-based video editing method. The work also exploits decomposed spatial and temporal attention maps for their guidance term. There should be a discussion on how STR-match is different from
1. Strong results – the qualitative examples (in the supplementary HTML and ZIP files) are impressive. Even for non-trivial edits involving large domain shifts, STR-Match maintains high fidelity and coherence. 2. Clarity and presentation – the paper is well organized and easy to follow, with intuitive figures (2,3) which make the method readable and clearer.
1. Dependence on the number of temporal neighbors - the method appears to rely heavily on the number of frames (neighbors) used when computing the spatiotemporal relevance. This parameter directly affects qualitative quality, quantitative scores, and runtime. However, it is not reported or ablated (as far as I could find). Clarifying how many neighbors are used, and analyzing the method’s sensitivity to this choice is important. Dataset curation - the paper states that the authors collected the
1. A training-free method for text-guided video editing. 2. A SpatioTemporal Relevance (STR) score is proposed to model pixel relationships across frames.
1. The quality and writing of this paper are inferior. In fig. 1, fig.2, and fig.4, the square video samples from the TGVE dataset are forced to be distorted into rectangles. The appendix link is also broken. There is no appendix in the paper at all. Why did the author add it? Therefore, I think this paper is very hasty and cannot meet the submission standards. 2. The core idea of using attention maps from a pre-trained model to guide editing is well-established in image editing (e.g., Prompt-to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Visual Attention and Saliency Detection
