MTVR: Multilingual Moment Retrieval in Videos
Jie Lei, Tamara L. Berg, Mohit Bansal

TL;DR
This paper introduces MTVR, a large-scale multilingual video moment retrieval dataset with English and Chinese queries, and proposes mXML, a model that effectively handles multilingual data, outperforming monolingual baselines.
Contribution
The paper presents MTVR, a new multilingual dataset for video moment retrieval, and mXML, a novel multilingual retrieval model leveraging shared encoders and language constraints.
Findings
mXML outperforms monolingual baselines on MTVR
MTVR is larger and more diverse than existing datasets
mXML achieves comparable or better accuracy with fewer parameters
Abstract
We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles. Compared to existing moment retrieval datasets, mTVR is multilingual, larger, and comes with diverse annotations. We further propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages, via encoder parameter sharing and language neighborhood constraints. We demonstrate the effectiveness of mXML on the newly collected MTVR dataset, where mXML outperforms strong monolingual baselines while using fewer parameters. In addition, we also provide detailed dataset analyses and model ablations. Data and code are publicly available at https://github.com/jayleicn/mTVRetrieval
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
