Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection

Sa Zhu; Wanqian Zhang; Lin Wang; Xiaohua Chen; Chenxu Cui; Jinchao Zhang; Bo Li

arXiv:2603.24030·cs.CV·March 26, 2026

Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection

Sa Zhu, Wanqian Zhang, Lin Wang, Xiaohua Chen, Chenxu Cui, Jinchao Zhang, Bo Li

PDF

Open Access

TL;DR

This paper introduces a novel framework that decomposes action labels into phases using large language models, enabling fine-grained alignment and transfer of action knowledge for improved open-vocabulary temporal action detection.

Contribution

It proposes a phase-wise decomposition and alignment framework utilizing language models for better transfer learning in OV-TAD, which was not addressed by previous global alignment methods.

Findings

01

Outperforms existing methods on OV-TAD benchmarks

02

Enables effective transfer of action patterns to unseen categories

03

Improves phase-level semantic alignment and detection accuracy

Abstract

Open-Vocabulary Temporal Action Detection (OV-TAD) aims to classify and localize action segments in untrimmed videos for unseen categories. Previous methods rely solely on global alignment between label-level semantics and visual features, which is insufficient to transfer temporal consistent visual knowledge from seen to unseen classes. To address this, we propose a Phase-wise Decomposition and Alignment (PDA) framework, which enables fine-grained action pattern learning for effective prior knowledge transfer. Specifically, we first introduce the CoT-Prompting Semantic Decomposition (CSD) module, which leverages the chain-of-thought (CoT) reasoning ability of large language models to automatically decompose action labels into coherent phase-level descriptions, emulating human cognitive processes. Then, Text-infused Foreground Filtering (TIF) module is introduced to adaptively filter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Action Observation and Synchronization