Text-driven Online Action Detection

Manuel Benavent-Lledo; David Mulero-P\'erez; David Ortiz-Perez; Jose Garcia-Rodriguez

arXiv:2501.13518·cs.CV·January 30, 2026

Text-driven Online Action Detection

Manuel Benavent-Lledo, David Mulero-P\'erez, David Ortiz-Perez, Jose Garcia-Rodriguez

PDF

1 Repo

TL;DR

This paper introduces TOAD, a novel text-driven online action detection model utilizing CLIP embeddings, achieving high accuracy and supporting zero-shot and few-shot learning with reduced computational costs.

Contribution

The paper presents TOAD, the first architecture to leverage vision-language models for online action detection, enabling efficient zero-shot and few-shot learning.

Findings

01

Achieves 82.46% mAP on THUMOS14 dataset.

02

Outperforms existing methods in zero-shot and few-shot settings.

03

Sets new benchmarks for online action detection performance.

Abstract

Detecting actions as they occur is essential for applications like video surveillance, autonomous driving, and human-robot interaction. Known as online action detection, this task requires classifying actions in streaming videos, handling background noise, and coping with incomplete actions. Transformer architectures are the current state-of-the-art, yet the potential of recent advancements in computer vision, particularly vision-language models (VLMs), remains largely untapped for this problem, partly due to high computational costs. In this paper, we introduce TOAD: a Text-driven Online Action Detection architecture that supports zero-shot and few-shot learning. TOAD leverages CLIP (Contrastive Language-Image Pretraining) textual embeddings, enabling efficient use of VLMs without significant computational overhead. Our model achieves 82.46% mAP on the THUMOS14 dataset, outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

3dperceptionlab/toad
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Adam · Softmax · Absolute Position Encodings · Residual Connection · Dropout · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer