Learning Joint Representations of Videos and Sentences with Web Image Search
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkil\"a, Naokazu, Yokoya

TL;DR
This paper introduces a joint embedding model for videos, sentences, and images that leverages web image search to improve fine-grained visual concept understanding, enhancing video and sentence retrieval performance.
Contribution
The paper proposes a novel embedding approach that incorporates web image search for disambiguating visual concepts and trains models for video, sentence, and image inputs simultaneously.
Findings
Improved accuracy in video and sentence retrieval tasks.
Comparable performance in description generation to state-of-the-art.
Effective use of web images for fine-grained visual concept disambiguation.
Abstract
Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
