Rudder: A Cross Lingual Video and Text Retrieval Dataset
Jayaprakash A, Abhishek, Rishabh Dabral, Ganesh Ramakrishnan, Preethi, Jyothi

TL;DR
This paper introduces Rudder, a multilingual video-text retrieval dataset, and proposes a partial order loss to improve joint embeddings, especially in data-scarce multilingual settings, outperforming traditional loss functions.
Contribution
The paper presents Rudder, a new multilingual dataset for video-text retrieval, and introduces a partial order loss that enhances embedding quality in low-data scenarios.
Findings
Partial order loss outperforms max-margin and triplet losses.
Significant improvements in retrieval performance on MSR-VTT and DiDeMO.
Cross-lingual training enhances retrieval accuracy across languages.
Abstract
Video retrieval using natural language queries requires learning semantically meaningful joint embeddings between the text and the audio-visual input. Often, such joint embeddings are learnt using pairwise (or triplet) contrastive loss objectives which cannot give enough attention to 'difficult-to-retrieve' samples during training. This problem is especially pronounced in data-scarce settings where the data is relatively small (10% of the large scale MSR-VTT) to cover the rather complex audio-visual embedding space. In this context, we introduce Rudder - a multilingual video-text retrieval dataset that includes audio and textual captions in Marathi, Hindi, Tamil, Kannada, Malayalam and Telugu. Furthermore, we propose to compensate for data scarcity by using domain knowledge to augment supervision. To this end, in addition to the conventional three samples of a triplet (anchor, positive,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Human Pose and Action Recognition
