TL;DR
This paper introduces a feature-space multimodal data augmentation method for text-video retrieval that enhances performance by generating new samples through semantic mixing, avoiding raw data transformations and addressing copyright issues.
Contribution
The paper presents a novel feature-space augmentation technique for text-video retrieval that improves accuracy without relying on resource-intensive raw data modifications.
Findings
Significant performance improvements on EPIC-Kitchens-100 dataset
Achieved state-of-the-art results with the proposed method
Conducted extensive ablation studies confirming effectiveness
Abstract
Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
