The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval
Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Franca Debole,, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo, Claudio Vairo

TL;DR
VISIONE is a flexible large-scale video search system that leverages off-the-shelf text search engines by encoding visual and metadata features into textual form, enabling complex multi-modal queries.
Contribution
The paper introduces VISIONE, a novel video retrieval system that encodes visual features and metadata into text for efficient search using standard text search engines.
Findings
System achieves effective retrieval performance.
Fine-tuning improves search accuracy.
Supports complex multi-modal queries.
Abstract
In this paper, we describe in details VISIONE, a video search system that allows users to search for videos using textual keywords, occurrence of objects and their spatial relationships, occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and satisfy user needs. The peculiarity of our approach is that we encode all the information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) have to be merged. In addition, we report an extensive analysis of the system retrieval performance, using the query logs generated during the Video Browser Showdown (VBS) 2019…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
