Knowledge Distillation for Efficient Audio-Visual Video Captioning

\"Ozkan \c{C}ayl{\i}; Xubo Liu; Volkan K{\i}l{\i}\c{c}; Wenwu Wang

arXiv:2306.09947·eess.AS·June 19, 2023·EUSIPCO·1 cites

Knowledge Distillation for Efficient Audio-Visual Video Captioning

\"Ozkan \c{C}ayl{\i}, Xubo Liu, Volkan K{\i}l{\i}\c{c}, Wenwu Wang

PDF

Open Access

TL;DR

This paper introduces a knowledge distillation approach combined with pooling and down-sampling techniques to create a lightweight audio-visual video captioning model, achieving 80% faster inference with minimal accuracy loss.

Contribution

It presents a novel method that reduces model size and inference time for video captioning by leveraging knowledge distillation and efficient data sampling techniques.

Findings

01

80% reduction in inference time

02

Less than 0.02% decrease in captioning accuracy

03

Effective model compression for deployment on low-power devices

Abstract

Automatically describing audio-visual content with texts, namely video captioning, has received significant attention due to its potential applications across diverse fields. Deep neural networks are the dominant methods, offering state-of-the-art performance. However, these methods are often undeployable in low-power devices like smartphones due to the large size of the model parameters. In this paper, we propose to exploit simple pooling front-end and down-sampling algorithms with knowledge distillation for audio and visual attributes using a reduced number of audio-visual frames. With the help of knowledge distillation from the teacher model, our proposed method greatly reduces the redundant information in audio-visual streams without losing critical contexts for caption generation. Extensive experimental evaluations on the MSR-VTT dataset demonstrate that our proposed approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Subtitles and Audiovisual Media · Multimodal Machine Learning Applications

MethodsKnowledge Distillation