Efficient Temporal Sentence Grounding in Videos with Multi-Teacher   Knowledge Distillation

Renjie Liang; Yiming Yang; Hui Lu; Li Li

arXiv:2308.03725·cs.CV·July 25, 2024

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

Renjie Liang, Yiming Yang, Hui Lu, Li Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces an efficient multi-teacher knowledge distillation framework for temporal sentence grounding in videos, achieving high accuracy with reduced computational complexity.

Contribution

It proposes a novel EMTM model with a unified feature approach, a knowledge aggregation unit, and a shared encoder strategy to enhance efficiency and performance.

Findings

01

Outperforms existing methods on three benchmarks.

02

Reduces computational load while maintaining accuracy.

03

Effectively integrates diverse teacher knowledge.

Abstract

Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

renjie-liang/emet
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsKnowledge Distillation · ALIGN