C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual   Text-Video Retrieval

Andrew Rouditchenko; Yung-Sung Chuang; Nina Shvetsova; Samuel Thomas,; Rogerio Feris; Brian Kingsbury; Leonid Karlinsky; David Harwath; Hilde; Kuehne; James Glass

arXiv:2210.03625·cs.CL·May 11, 2023

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas,, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde, Kuehne, James Glass

PDF

Open Access 1 Repo

TL;DR

This paper introduces a cross-lingual, cross-modal knowledge distillation approach to enhance multilingual text-video retrieval, leveraging English-based teacher models and a new multilingual dataset to improve performance across languages.

Contribution

It proposes a novel knowledge distillation method using English models to improve retrieval in multiple languages and introduces a new multilingual video dataset, Multi-YouCook2.

Findings

01

Significant performance improvements on Multi-YouCook2 and other datasets.

02

Effective use of English teacher models for multilingual retrieval.

03

Availability of code, models, and dataset for future research.

Abstract

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

roudimit/c2kd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsKnowledge Distillation