SpotEM: Efficient Video Search for Episodic Memory
Santhosh Kumar Ramakrishnan, Ziad Al-Halah, Kristen Grauman

TL;DR
SpotEM introduces an efficient video search method for episodic memory that intelligently selects promising video regions and uses semantic indexing to reduce computational costs while maintaining high accuracy.
Contribution
It presents a novel clip selector, semantic indexing features, and distillation losses to improve efficiency in long video search for episodic memory.
Findings
Reduces clip feature computation to 10-25% of original.
Maintains 84-97% of the original accuracy.
Effective across multiple EM models and long videos.
Abstract
The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
