Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction
Luowei Zhou, Nathan Louis, Jason J. Corso

TL;DR
This paper introduces a weakly-supervised method for video object grounding that propagates supervision across frames and leverages object interactions, improving localization accuracy without bounding box annotations.
Contribution
It proposes a novel loss weighting strategy and utilizes object interactions to enhance weakly-supervised video object grounding performance.
Findings
Achieved improved grounding accuracy on YouCook2-BoundingBox benchmark.
Effectively propagates supervision from segment to frames with sparse object presence.
Leverages object interactions as textual cues for better grounding.
Abstract
We study weakly-supervised video object grounding: given a video segment and a corresponding descriptive sentence, the goal is to localize objects that are mentioned from the sentence in the video. During training, no object bounding boxes are available, but the set of possible objects to be grounded is known beforehand. Existing approaches in the image domain use Multiple Instance Learning (MIL) to ground objects by enforcing matches between visual and semantic features. A naive extension of this approach to the video domain is to treat the entire segment as a bag of spatial object proposals. However, an object existing sparsely across multiple frames might not be detected completely since successfully spotting it from one single frame would trigger a satisfactory match. To this end, we propagate the weak supervisory signal from the segment level to frames that likely contain the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
