TL;DR
This paper introduces RTime, a new temporal-emphasized video-text retrieval dataset with reversed videos and challenging negatives, to better evaluate and improve models' temporal understanding in cross-modal retrieval tasks.
Contribution
The paper presents RTime, a novel dataset with reversed videos and enhanced negatives, specifically designed to assess and advance temporal understanding in video-text retrieval models.
Findings
RTime poses higher challenges to existing models.
Models show reduced performance on RTime compared to traditional benchmarks.
The dataset facilitates development of more temporally-aware retrieval methods.
Abstract
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in comprehensively assessing abilities of models, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models. In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We then recruit annotators to judge the significance and reversibility of candidate videos, and write captions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Softmax · Dense Connections · Dropout · Residual Connection · Multi-Head Attention · Adam
