Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation
Renjie Liang, Yiming Yang, Hui Lu, Li Li

TL;DR
This paper introduces an efficient multi-teacher knowledge distillation framework for temporal sentence grounding in videos, achieving high accuracy with reduced computational complexity.
Contribution
It proposes a novel EMTM model with a unified feature approach, a knowledge aggregation unit, and a shared encoder strategy to enhance efficiency and performance.
Findings
Outperforms existing methods on three benchmarks.
Reduces computational load while maintaining accuracy.
Effectively integrates diverse teacher knowledge.
Abstract
Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance. Most existing approaches exquisitely design complex architectures to improve accuracy with extra layers and loss, suffering from inefficiency and heaviness. Although some works have noticed that, they only make an issue of feature fusion layers, which can hardly enjoy the highspeed merit in the whole clunky network. To tackle this problem, we propose a novel efficient multi-teacher model (EMTM) based on knowledge distillation to transfer diverse knowledge from both heterogeneous and isomorphic networks. Specifically, We first unify different outputs of the heterogeneous models into one single form. Next, a Knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsKnowledge Distillation · ALIGN
